Donors choose data analysis

Little History about Data Set

Founded in 2000 by a high school teacher in the Bronx, DonorsChoose.org empowers public school teachers from across the country to request much-needed materials and experiences for their students. At any given time, there are thousands of classroom requests that can be brought to life with a gift of any amount.

Answers to What and Why Questions on Data Set

DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website.

Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:

  • How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible
  • How to increase the consistency of project vetting across different volunteers to improve the experience for teachers
  • How to focus volunteer time on the applications that need the most assistance

The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.

About the DonorsChoose Data Set

The train.csv data set provided by DonorsChoose contains the following features:

Feature Description
project_id A unique identifier for the proposed project. Example: p036502
project_title Title of the project. Examples:
  • Art Will Make You Happy!
  • First Grade Fun
project_grade_category Grade level of students for which the project is targeted. One of the following enumerated values:
  • Grades PreK-2
  • Grades 3-5
  • Grades 6-8
  • Grades 9-12
project_subject_categories One or more (comma-separated) subject categories for the project from the following enumerated list of values:
  • Applied Learning
  • Care & Hunger
  • Health & Sports
  • History & Civics
  • Literacy & Language
  • Math & Science
  • Music & The Arts
  • Special Needs
  • Warmth

Examples:
  • Music & The Arts
  • Literacy & Language, Math & Science
school_state State where school is located (Two-letter U.S. postal code). Example: WY
project_subject_subcategories One or more (comma-separated) subject subcategories for the project. Examples:
  • Literacy
  • Literature & Writing, Social Sciences
project_resource_summary An explanation of the resources needed for the project. Example:
  • My students need hands on literacy materials to manage sensory needs!
project_essay_1 First application essay*
project_essay_2 Second application essay*
project_essay_3 Third application essay*
project_essay_4 Fourth application essay*
project_submitted_datetime Datetime when project application was submitted. Example: 2016-04-28 12:43:56.245
teacher_id A unique identifier for the teacher of the proposed project. Example: bdf8baa8fedef6bfeec7ae4ff1c15c56
teacher_prefix Teacher's title. One of the following enumerated values:
  • nan
  • Dr.
  • Mr.
  • Mrs.
  • Ms.
  • Teacher.
teacher_number_of_previously_posted_projects Number of project applications previously submitted by the same teacher. Example: 2

* See the section Notes on the Essay Data for more details about these features.

Additionally, the resources.csv data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:

Feature Description
id A project_id value from the train.csv file. Example: p036502
description Desciption of the resource. Example: Tenor Saxophone Reeds, Box of 25
quantity Quantity of the resource required. Example: 3
price Price of the resource required. Example: 9.95

Note: Many projects require multiple resources. The id value corresponds to a project_id in train.csv, so you use it as a key to retrieve all resources needed for a project:

The data set contains the following label (the value you will attempt to predict):

Label Description
project_is_approved A binary flag indicating whether DonorsChoose approved the project. A value of 0 indicates the project was not approved, and a value of 1 indicates the project was approved.

Notes on the Essay Data

    Prior to May 17, 2016, the prompts for the essays were as follows:
  • __project_essay_1:__ "Introduce us to your classroom"
  • __project_essay_2:__ "Tell us more about your students"
  • __project_essay_3:__ "Describe how your students will use the materials you're requesting"
  • __project_essay_4:__ "Close by sharing why your project will make a difference"
    Starting on May 17, 2016, the number of essays was reduced from 4 to 2, and the prompts for the first 2 essays were changed to the following:
  • __project_essay_1:__ "Describe your students: What makes your students special? Specific details about their background, your neighborhood, and your school are all helpful."
  • __project_essay_2:__ "About your project: How will these materials make a difference in your students' learning and improve their school lives?"

  • For all projects with project_submitted_datetime of 2016-05-17 and later, the values of project_essay_3 and project_essay_4 will be NaN.

Importing required libraries

In [1]:
# numpy for easy numerical computations
import numpy as np
# pandas for dataframes and filterings
import pandas as pd
# sqlite3 library for performing operations on sqlite file
import sqlite3
# matplotlib for plotting graphs
import matplotlib.pyplot as plt
# seaborn library for easy plotting
import seaborn as sbrn
# warnings library for specific settings
import warnings
# regularlanguage for regex operations
import re
# For loading precomputed models
import pickle
# For loading natural language processing tool-kit
import nltk

# For loading files from google drive
from google.colab import drive
# For working with files in google drive
drive.mount('/content/drive')
# tqdm for tracking progress of loops
from tqdm import tqdm_notebook as tqdm
# For creating dictionary of words
from collections import Counter
# For creating BagOfWords Model
from sklearn.feature_extraction.text import CountVectorizer
# For creating TfidfModel
from sklearn.feature_extraction.text import TfidfVectorizer
# For standardizing values
from sklearn.preprocessing import StandardScaler
# For merging sparse matrices along row direction
from scipy.sparse import hstack
# For merging sparse matrices along column direction
from scipy.sparse import vstack
# For calculating TSNE values
from sklearn.manifold import TSNE
# For calculating the accuracy score on cross validate data
from sklearn.metrics import accuracy_score
# For performing the k-fold cross validation
from sklearn.model_selection import cross_val_score
# For splitting the data set into test and train data
from sklearn import model_selection
# Support Vector classifier for classification
from sklearn.svm import SVC
# For reducing dimensions of data
from sklearn.decomposition import TruncatedSVD
# For using svm classifer - hinge loss function of sgd
from sklearn import linear_model
# For creating samples for making dataset balanced
from sklearn.utils import resample
# For shuffling the dataframes
from sklearn.utils import shuffle
# For calculating roc_curve parameters
from sklearn.metrics import roc_curve
# For calculating auc value
from sklearn.metrics import auc
# For displaying results in table format
from prettytable import PrettyTable
# For generating confusion matrix
from sklearn.metrics import confusion_matrix
# For using gridsearch cv to find best parameter
from sklearn.model_selection import GridSearchCV
# For performing min-max standardization to features
from sklearn.preprocessing import MinMaxScaler
# For calculating sentiment score of the text
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

warnings.filterwarnings('ignore')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive
/usr/local/lib/python3.6/dist-packages/nltk/twitter/__init__.py:20: UserWarning: The twython library has not been installed. Some functionality from the twitter package will not be available.
  warnings.warn("The twython library has not been installed. "
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...

Reading and Storing Data

In [0]:
projectsData = pd.read_csv('drive/My Drive/train_data.csv');
resourcesData = pd.read_csv('drive/My Drive/resources.csv');
In [3]:
projectsData.head(3)
Out[3]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved
0 160221 p253737 c90749f5d961ff158d4b4d1e7dc665fc Mrs. IN 2016-12-05 13:43:57 Grades PreK-2 Literacy & Language ESL, Literacy Educational Support for English Learners at Home My students are English learners that are work... \"The limits of your language are the limits o... NaN NaN My students need opportunities to practice beg... 0 0
1 140945 p258326 897464ce9ddc600bced1151f324dd63a Mr. FL 2016-10-25 09:22:10 Grades 6-8 History & Civics, Health & Sports Civics & Government, Team Sports Wanted: Projector for Hungry Learners Our students arrive to our school eager to lea... The projector we need for our school is very c... NaN NaN My students need a projector to help with view... 7 1
2 21895 p182444 3465aaf82da834c0582ebd0ef8040ca0 Ms. AZ 2016-08-31 12:03:56 Grades 6-8 Health & Sports Health & Wellness, Team Sports Soccer Equipment for AWESOME Middle School Stu... \r\n\"True champions aren't always the ones th... The students on the campus come to school know... NaN NaN My students need shine guards, athletic socks,... 1 0
In [4]:
projectsData.tail(3)
Out[4]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved
109245 143653 p155633 cdbfd04aa041dc6739e9e576b1fb1478 Mrs. NJ 2016-08-25 17:11:32 Grades PreK-2 Literacy & Language, Math & Science Literacy, Mathematics 2016/2017 Beginning of the Year Basics This is a great group of sharing and caring st... My students learn about special events, holida... NaN NaN My students need giant comfy pillows in order ... 3 1
109246 164599 p206114 6d5675dbfafa1371f0e2f6f1b716fe2d Mrs. NY 2016-07-29 17:53:15 Grades 3-5 Health & Sports, Special Needs Health & Wellness, Special Needs Flexible Seating in Inclusive Classroom Our students live in a small rural community. ... Flexible classroom seating has been researched... NaN NaN My students need flexible seating options: bea... 0 1
109247 128381 p191189 ca25d5573f2bd2660f7850a886395927 Ms. VA 2016-06-29 09:17:01 Grades 6-8 Applied Learning, Math & Science College & Career Prep, Mathematics Classroom Tech to Develop 21st Century Leaders When was the last time that you used math? Pro... According to Forbes Magazine (2014), companies... NaN NaN My students need opportunities to work with te... 0 1
In [5]:
resourcesData.head(3)
Out[5]:
id description quantity price
0 p233245 LC652 - Lakeshore Double-Space Mobile Drying Rack 1 149.00
1 p069063 Bouncy Bands for Desks (Blue support pipes) 3 14.95
2 p069063 Cory Stories: A Kid's Book About Living With Adhd 1 8.45
In [6]:
resourcesData.tail(3)
Out[6]:
id description quantity price
1541269 p031981 Black Electrical Tape (GIANT 3 PACK) Each Roll... 6 8.99
1541270 p031981 Flormoon DC Motor Mini Electric Motor 0.5-3V 1... 2 8.14
1541271 p031981 WAYLLSHINE 6PCS 2 x 1.5V AAA Battery Spring Cl... 2 7.39

Helper functions and classes

In [0]:
def equalsBorder(numberOfEqualSigns):
    """
    This function prints passed number of equal signs
    """
    print("="* numberOfEqualSigns);
In [0]:
# Citation link: https://stackoverflow.com/questions/8924173/how-do-i-print-bold-text-in-python
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'
In [0]:
def printStyle(text, style):
    "This function prints text with the style passed to it"
    print(style + text + color.END);

Shapes of projects data and resources data

In [10]:
printStyle("Number of data points in projects data: {}".format(projectsData.shape[0]), color.BOLD);
printStyle("Number of attributes in projects data:{}".format(projectsData.shape[1]), color.BOLD);
equalsBorder(60);
printStyle("Number of data points in resources data: {}".format(resourcesData.shape[0]), color.BOLD);
printStyle("Number of attributes in resources data: {}".format(resourcesData.shape[1]), color.BOLD);
Number of data points in projects data: 109248
Number of attributes in projects data:17
============================================================
Number of data points in resources data: 1541272
Number of attributes in resources data: 4

Univariate data analysis

In [11]:
approvedProjects = projectsData[projectsData.project_is_approved == 1].shape[0];
unApprovedProjects = projectsData[projectsData.project_is_approved == 0].shape[0];
totalProjects = projectsData.shape[0];
print("Number of projects approved for funding: {}, ({})".format(approvedProjects, (approvedProjects / totalProjects) * 100));
print("Number of projects not approved for funding: {}, ({})".format(unApprovedProjects, (unApprovedProjects / totalProjects) * 100));
# Pie chart representation
# Citation: https://matplotlib.org/gallery/pie_and_polar_charts/pie_features.html
labels = ["Approved Projects", "UnApproved Projects"];
explode = (0, 0.1);
sizes = [approvedProjects, unApprovedProjects];
figure, ax = plt.subplots();
ax.pie(sizes, labels = labels, explode = explode, autopct = '%1.1f%%', shadow = True, startangle = 90);
ax.axis('equal');
plt.rcParams['figure.figsize'] = (7, 7);
plt.show();
Number of projects approved for funding: 92706, (84.85830404217927)
Number of projects not approved for funding: 16542, (15.141695957820739)

Observation:

  1. There are more number of approved projects compared to rejected projects. So this is a imbalanced dataset.

Univariate Analysis : 'school_state'

Project proposal percentage in different states

In [12]:
groupedByStatesData = pd.DataFrame(projectsData.groupby(['school_state'])['project_is_approved'].apply(np.mean)).reset_index();
groupedByStatesData.columns = ['state_code', 'number_of_proposals'];
groupedByStatesData = groupedByStatesData.sort_values(by=['number_of_proposals'], ascending = True);
printStyle("5 States with lowest percentage of project approvals:", color.BOLD);
equalsBorder(60);
groupedByStatesData.head(5)
5 States with lowest percentage of project approvals:
============================================================
Out[12]:
state_code number_of_proposals
46 VT 0.800000
7 DC 0.802326
43 TX 0.813142
26 MT 0.816327
18 LA 0.831245
In [13]:
printStyle("5 states with highest percentage of project approvals: ", color.BOLD);
equalsBorder(60);
groupedByStatesData.tail(5).iloc[::-1]
5 states with highest percentage of project approvals: 
============================================================
Out[13]:
state_code number_of_proposals
8 DE 0.897959
28 ND 0.888112
47 WA 0.876178
35 OH 0.875152
30 NH 0.873563
In [0]:
def univariateBarPlots(data, col1, col2 = 'project_is_approved', orientation = 'vertical', plot = True):
    groupedData = data.groupby(col1);
    # Count number of zeros in dataframe python: https://stackoverflow.com/a/51540521/4084039
    tempData = pd.DataFrame(groupedData[col2].agg(lambda x: x.eq(1).sum())).reset_index();
    tempData['total'] = pd.DataFrame(groupedData[col2].agg({'total': 'count'})).reset_index()['total'];
    tempData['approval_rate'] = pd.DataFrame(groupedData[col2].agg({'approval_rate': 'mean'})).reset_index()['approval_rate'];
    tempData.sort_values(by=['total'], inplace = True, ascending = False);
    tempDataWithTotalAndCol2 = tempData[['total', col2, col1]]
    if plot:
        if(orientation == 'vertical'):
            tempDataWithTotalAndCol2.plot(x = col1, align= 'center', kind = 'bar', title = "Number of projects approved vs rejected", figsize = (20, 6), stacked = True, rot = 0);
        else:
            tempDataWithTotalAndCol2.plot(x = col1, align= 'center', kind = 'barh', title = "Number of projects approved vs rejected", width = 0.8, figsize = (23, 20), stacked = True);
    return tempData;
In [15]:
statesCharacteristicsData = univariateBarPlots(projectsData, 'school_state', 'project_is_approved', orientation = 'vertical');
printStyle("Top 5 states with high project proposals", color.BOLD)
equalsBorder(60);
statesCharacteristicsData.head(5)
Top 5 states with high project proposals
============================================================
Out[15]:
school_state project_is_approved total approval_rate
4 CA 13205 15388 0.858136
43 TX 6014 7396 0.813142
34 NY 6291 7318 0.859661
9 FL 5144 6185 0.831690
27 NC 4353 5091 0.855038
In [16]:
printStyle("Top 5 states with least project proposals", color.BOLD)
equalsBorder(60);
statesCharacteristicsData.tail(5)
Top 5 states with least project proposals
============================================================
Out[16]:
school_state project_is_approved total approval_rate
39 RI 243 285 0.852632
26 MT 200 245 0.816327
28 ND 127 143 0.888112
50 WY 82 98 0.836735
46 VT 64 80 0.800000

Observation:

  1. Highest number of project proposals are from CA(California) and it was almost about 16000 projects
  2. Every state has more than 80% approval rate.

Univariate Analysis: teacher_prefix

In [17]:
teacherPrefixCharacteristicsData = univariateBarPlots(projectsData, 'teacher_prefix', 'project_is_approved', orientation = 'vertical', plot = True);
printStyle("Project proposals characteristics based on types of persons", color.BOLD);
equalsBorder(60);
teacherPrefixCharacteristicsData
Project proposals characteristics based on types of persons
============================================================
Out[17]:
teacher_prefix project_is_approved total approval_rate
2 Mrs. 48997 57269 0.855559
3 Ms. 32860 38955 0.843537
1 Mr. 8960 10648 0.841473
4 Teacher 1877 2360 0.795339
0 Dr. 9 13 0.692308

Observataion:

  1. When compared to others Dr.'s have proposed very less number of projects.
  2. Women have proposed more number of projects than men.

Univariate Analysis: project_grade

In [18]:
gradeCharacteristicsData = univariateBarPlots(projectsData, 'project_grade_category', 'project_is_approved', orientation = 'vertical', plot = True);
printStyle("Project proposal characteristics based on grades", color.BOLD);
equalsBorder(60);
gradeCharacteristicsData
Project proposal characteristics based on grades
============================================================
Out[18]:
project_grade_category project_is_approved total approval_rate
3 Grades PreK-2 37536 44225 0.848751
0 Grades 3-5 31729 37137 0.854377
1 Grades 6-8 14258 16923 0.842522
2 Grades 9-12 9183 10963 0.837636

Observation:

  1. Most number of projects proposed are for students less than grade-5 (for primary school students) which means that children are being taught with project oriented teaching which is great.

Univariate Analysis: project_subject_categories

In [0]:
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
def cleanCategories(subjectCategories):
    cleanedCategories = []
    for subjectCategory in tqdm(subjectCategories):
        tempCategory = ""
        for category in subjectCategory.split(","):
            if 'The' in category.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
                category = category.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
            category = category.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
            tempCategory += category.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
            tempCategory = tempCategory.replace('&','_')
        cleanedCategories.append(tempCategory)
    return cleanedCategories
In [20]:
# projectDataWithCleanedCategories = pd.DataFrame(projectsData);
subjectCategories = list(projectsData.project_subject_categories);
cleanedCategories = cleanCategories(subjectCategories);
printStyle("Sample categories: ", color.BOLD);
equalsBorder(60);
print(subjectCategories[0:5]);
equalsBorder(60);
printStyle("Sample cleaned categories: ", color.BOLD);
equalsBorder(60);
print(cleanedCategories[0:5]);
projectsData['cleaned_categories'] = cleanedCategories;
projectsData.head(5)
Sample categories: 
============================================================
['Literacy & Language', 'History & Civics, Health & Sports', 'Health & Sports', 'Literacy & Language, Math & Science', 'Math & Science']
============================================================
Sample cleaned categories: 
============================================================
['Literacy_Language ', 'History_Civics Health_Sports ', 'Health_Sports ', 'Literacy_Language Math_Science ', 'Math_Science ']
Out[20]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved cleaned_categories
0 160221 p253737 c90749f5d961ff158d4b4d1e7dc665fc Mrs. IN 2016-12-05 13:43:57 Grades PreK-2 Literacy & Language ESL, Literacy Educational Support for English Learners at Home My students are English learners that are work... \"The limits of your language are the limits o... NaN NaN My students need opportunities to practice beg... 0 0 Literacy_Language
1 140945 p258326 897464ce9ddc600bced1151f324dd63a Mr. FL 2016-10-25 09:22:10 Grades 6-8 History & Civics, Health & Sports Civics & Government, Team Sports Wanted: Projector for Hungry Learners Our students arrive to our school eager to lea... The projector we need for our school is very c... NaN NaN My students need a projector to help with view... 7 1 History_Civics Health_Sports
2 21895 p182444 3465aaf82da834c0582ebd0ef8040ca0 Ms. AZ 2016-08-31 12:03:56 Grades 6-8 Health & Sports Health & Wellness, Team Sports Soccer Equipment for AWESOME Middle School Stu... \r\n\"True champions aren't always the ones th... The students on the campus come to school know... NaN NaN My students need shine guards, athletic socks,... 1 0 Health_Sports
3 45 p246581 f3cb9bffbba169bef1a77b243e620b60 Mrs. KY 2016-10-06 21:16:17 Grades PreK-2 Literacy & Language, Math & Science Literacy, Mathematics Techie Kindergarteners I work at a unique school filled with both ESL... My students live in high poverty conditions wi... NaN NaN My students need to engage in Reading and Math... 4 1 Literacy_Language Math_Science
4 172407 p104768 be1f7507a41f8479dc06f047086a39ec Mrs. TX 2016-07-11 01:10:09 Grades PreK-2 Math & Science Mathematics Interactive Math Tools Our second grade classroom next year will be m... For many students, math is a subject that does... NaN NaN My students need hands on practice in mathemat... 1 1 Math_Science
In [21]:
categoriesCharacteristicsData = univariateBarPlots(projectsData, 'cleaned_categories', 'project_is_approved', orientation = 'horizontal', plot = True);
print("Project proposals characteristics based on subject categories");
equalsBorder(60);
categoriesCharacteristicsData.head(5)
Project proposals characteristics based on subject categories
============================================================
Out[21]:
cleaned_categories project_is_approved total approval_rate
24 Literacy_Language 20520 23655 0.867470
32 Math_Science 13991 17072 0.819529
28 Literacy_Language Math_Science 12725 14636 0.869432
8 Health_Sports 8640 10177 0.848973
40 Music_Arts 4429 5180 0.855019
In [22]:
# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
categoriesCounter = Counter()
for subjectCategory in projectsData.cleaned_categories.values:
    categoriesCounter.update(subjectCategory.split());
categoriesCounter
Out[22]:
Counter({'AppliedLearning': 12135,
         'Care_Hunger': 1388,
         'Health_Sports': 14223,
         'History_Civics': 5914,
         'Literacy_Language': 52239,
         'Math_Science': 41421,
         'Music_Arts': 10293,
         'SpecialNeeds': 13642,
         'Warmth': 1388})
In [23]:
# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
categoriesDictionary = dict(categoriesCounter);
sortedCategoriesDictionary = dict(sorted(categoriesDictionary.items(), key = lambda keyValue: keyValue[1]));
sortedCategoriesData = pd.DataFrame.from_dict(sortedCategoriesDictionary, orient='index');
sortedCategoriesData.columns = ['subject_categories'];
printStyle("Number of projects by Subject Categories: ", color.BOLD);
equalsBorder(60);
sortedCategoriesData
Number of projects by Subject Categories: 
============================================================
Out[23]:
subject_categories
Warmth 1388
Care_Hunger 1388
History_Civics 5914
Music_Arts 10293
AppliedLearning 12135
SpecialNeeds 13642
Health_Sports 14223
Math_Science 41421
Literacy_Language 52239
In [24]:
sortedCategoriesData.plot(kind = 'bar', title = 'Number of projects by subject categories');

Observation:

  1. Many number of projects proposed belong to multiple subject categories.
  2. When compared to others literacy_language & math_science have large number of project proposals.

Univariate Analysis: project_subject_subcategories

In [25]:
subjectSubCategories = projectsData.project_subject_subcategories;
cleanedSubCategories = cleanCategories(subjectSubCategories);
printStyle("Sample subject sub categories: ", color.BOLD);
equalsBorder(70);
print(subjectSubCategories[0:5]);
equalsBorder(70);
printStyle("Sample cleaned subject sub categories: ", color.BOLD);
equalsBorder(70);
print(cleanedSubCategories[0:5]);
projectsData['cleaned_sub_categories'] = cleanedSubCategories;
Sample subject sub categories: 
======================================================================
0                       ESL, Literacy
1    Civics & Government, Team Sports
2      Health & Wellness, Team Sports
3               Literacy, Mathematics
4                         Mathematics
Name: project_subject_subcategories, dtype: object
======================================================================
Sample cleaned subject sub categories: 
======================================================================
['ESL Literacy ', 'Civics_Government TeamSports ', 'Health_Wellness TeamSports ', 'Literacy Mathematics ', 'Mathematics ']
In [26]:
projectsData.head(5)
Out[26]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved cleaned_categories cleaned_sub_categories
0 160221 p253737 c90749f5d961ff158d4b4d1e7dc665fc Mrs. IN 2016-12-05 13:43:57 Grades PreK-2 Literacy & Language ESL, Literacy Educational Support for English Learners at Home My students are English learners that are work... \"The limits of your language are the limits o... NaN NaN My students need opportunities to practice beg... 0 0 Literacy_Language ESL Literacy
1 140945 p258326 897464ce9ddc600bced1151f324dd63a Mr. FL 2016-10-25 09:22:10 Grades 6-8 History & Civics, Health & Sports Civics & Government, Team Sports Wanted: Projector for Hungry Learners Our students arrive to our school eager to lea... The projector we need for our school is very c... NaN NaN My students need a projector to help with view... 7 1 History_Civics Health_Sports Civics_Government TeamSports
2 21895 p182444 3465aaf82da834c0582ebd0ef8040ca0 Ms. AZ 2016-08-31 12:03:56 Grades 6-8 Health & Sports Health & Wellness, Team Sports Soccer Equipment for AWESOME Middle School Stu... \r\n\"True champions aren't always the ones th... The students on the campus come to school know... NaN NaN My students need shine guards, athletic socks,... 1 0 Health_Sports Health_Wellness TeamSports
3 45 p246581 f3cb9bffbba169bef1a77b243e620b60 Mrs. KY 2016-10-06 21:16:17 Grades PreK-2 Literacy & Language, Math & Science Literacy, Mathematics Techie Kindergarteners I work at a unique school filled with both ESL... My students live in high poverty conditions wi... NaN NaN My students need to engage in Reading and Math... 4 1 Literacy_Language Math_Science Literacy Mathematics
4 172407 p104768 be1f7507a41f8479dc06f047086a39ec Mrs. TX 2016-07-11 01:10:09 Grades PreK-2 Math & Science Mathematics Interactive Math Tools Our second grade classroom next year will be m... For many students, math is a subject that does... NaN NaN My students need hands on practice in mathemat... 1 1 Math_Science Mathematics
In [27]:
subCategoriesCharacteristicsData = univariateBarPlots(projectsData, 'cleaned_sub_categories', 'project_is_approved', plot = False);
print("Project proposals characteristics based on subject sub categories");
equalsBorder(60);
subCategoriesCharacteristicsData.head(5)
Project proposals characteristics based on subject sub categories
============================================================
Out[27]:
cleaned_sub_categories project_is_approved total approval_rate
317 Literacy 8371 9486 0.882458
319 Literacy Mathematics 7260 8325 0.872072
331 Literature_Writing Mathematics 5140 5923 0.867803
318 Literacy Literature_Writing 4823 5571 0.865733
342 Mathematics 4385 5379 0.815207
In [28]:
# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
subjectsSubCategoriesCounter = Counter();
for subCategory in projectsData.cleaned_sub_categories:
    subjectsSubCategoriesCounter.update(subCategory.split());
subjectsSubCategoriesCounter
Out[28]:
Counter({'AppliedSciences': 10816,
         'Care_Hunger': 1388,
         'CharacterEducation': 2065,
         'Civics_Government': 815,
         'College_CareerPrep': 2568,
         'CommunityService': 441,
         'ESL': 4367,
         'EarlyDevelopment': 4254,
         'Economics': 269,
         'EnvironmentalScience': 5591,
         'Extracurricular': 810,
         'FinancialLiteracy': 568,
         'ForeignLanguages': 890,
         'Gym_Fitness': 4509,
         'Health_LifeScience': 4235,
         'Health_Wellness': 10234,
         'History_Geography': 3171,
         'Literacy': 33700,
         'Literature_Writing': 22179,
         'Mathematics': 28074,
         'Music': 3145,
         'NutritionEducation': 1355,
         'Other': 2372,
         'ParentInvolvement': 677,
         'PerformingArts': 1961,
         'SocialSciences': 1920,
         'SpecialNeeds': 13642,
         'TeamSports': 2192,
         'VisualArts': 6278,
         'Warmth': 1388})
In [29]:
# dict sort by value python: https://stackoverflow.com/a/613218/4084039
dictionarySubCategories = dict(subjectsSubCategoriesCounter);
sortedDictionarySubCategories = dict(sorted(dictionarySubCategories.items(), key = lambda keyValue: keyValue[1]));
sortedSubCategoriesData = pd.DataFrame.from_dict(sortedDictionarySubCategories, orient = 'index');
sortedSubCategoriesData.columns = ['subject_sub_categories']
sortedSubCategoriesData.plot(kind = 'bar', title = "Number of projects by subject sub categories");
printStyle("Number of projects sorted by subject sub categories: ", color.BOLD);
equalsBorder(70);
sortedSubCategoriesData
Number of projects sorted by subject sub categories: 
======================================================================
Out[29]:
subject_sub_categories
Economics 269
CommunityService 441
FinancialLiteracy 568
ParentInvolvement 677
Extracurricular 810
Civics_Government 815
ForeignLanguages 890
NutritionEducation 1355
Warmth 1388
Care_Hunger 1388
SocialSciences 1920
PerformingArts 1961
CharacterEducation 2065
TeamSports 2192
Other 2372
College_CareerPrep 2568
Music 3145
History_Geography 3171
Health_LifeScience 4235
EarlyDevelopment 4254
ESL 4367
Gym_Fitness 4509
EnvironmentalScience 5591
VisualArts 6278
Health_Wellness 10234
AppliedSciences 10816
SpecialNeeds 13642
Literature_Writing 22179
Mathematics 28074
Literacy 33700

Observation:

  1. There are more number of subject subcategories than subject categories.
  2. Even more number of projects proposed belong to multiple subject sub categories.

Univariate Analysis : project_title

In [30]:
#How to calculate number of words in a string in DataFrame: https://stackoverflow.com/a/37483537/4084039
wordCounts = projectsData['project_title'].str.split().apply(len).value_counts();
dictionaryWordCounts = dict(wordCounts);
dictionaryWordCounts = dict(sorted(dictionaryWordCounts.items(), key = lambda kv: kv[1]));
wordCountsData = pd.DataFrame.from_dict({'number_of_words': list(dictionaryWordCounts.keys()), 'number_of_projects': list(dictionaryWordCounts.values())}).sort_values(by = ['number_of_projects']);
wordCountsData.plot(kind = 'bar', title = "Number of projects vs Number of words in project title", legend = False);
plt.xlabel('Number of words');
plt.ylabel('Number of projects');
wordCountsData
Out[30]:
number_of_words number_of_projects
0 13 1
1 12 11
2 11 30
3 1 31
4 10 3968
5 9 5383
6 8 7289
7 2 8733
8 7 10631
9 6 14824
10 3 18691
11 5 19677
12 4 19979
In [31]:
approvedNumberOfProjects = projectsData[projectsData.project_is_approved == 1]['project_title'].str.split().apply(len);
approvedNumberOfProjects = approvedNumberOfProjects.values
unApprovedNumberOfProjects = projectsData[projectsData.project_is_approved == 0]['project_title'].str.split().apply(len);
unApprovedNumberOfProjects = unApprovedNumberOfProjects.values
plt.boxplot([approvedNumberOfProjects, unApprovedNumberOfProjects]);
plt.grid();
plt.xticks([1, 2], ['Approved Projects', 'UnApproved Projects']);
plt.ylabel('Number of words in title');
plt.show();
In [32]:
plt.figure(figsize = (10, 6));
sbrn.kdeplot(approvedNumberOfProjects, label = "Approved Projects", bw = 0.6);
sbrn.kdeplot(unApprovedNumberOfProjects, label = "UnApproved Projects", bw = 0.6);
plt.legend();
plt.show();

Observations:

  1. Most of the approved projects have between 4 to 8 number of words in their project_title.
  2. Most of the rejected projects have between 3 to 6 number of words in their project_title.

Univariate Analysis: project_essay_1,2,3,4

In [33]:
projectsData['project_essay'] = projectsData['project_essay_1'].map(str) + projectsData['project_essay_2'].map(str) + \
                                projectsData['project_essay_3'].map(str) + projectsData['project_essay_4'].map(str);
projectsData.head(5)
Out[33]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved cleaned_categories cleaned_sub_categories project_essay
0 160221 p253737 c90749f5d961ff158d4b4d1e7dc665fc Mrs. IN 2016-12-05 13:43:57 Grades PreK-2 Literacy & Language ESL, Literacy Educational Support for English Learners at Home My students are English learners that are work... \"The limits of your language are the limits o... NaN NaN My students need opportunities to practice beg... 0 0 Literacy_Language ESL Literacy My students are English learners that are work...
1 140945 p258326 897464ce9ddc600bced1151f324dd63a Mr. FL 2016-10-25 09:22:10 Grades 6-8 History & Civics, Health & Sports Civics & Government, Team Sports Wanted: Projector for Hungry Learners Our students arrive to our school eager to lea... The projector we need for our school is very c... NaN NaN My students need a projector to help with view... 7 1 History_Civics Health_Sports Civics_Government TeamSports Our students arrive to our school eager to lea...
2 21895 p182444 3465aaf82da834c0582ebd0ef8040ca0 Ms. AZ 2016-08-31 12:03:56 Grades 6-8 Health & Sports Health & Wellness, Team Sports Soccer Equipment for AWESOME Middle School Stu... \r\n\"True champions aren't always the ones th... The students on the campus come to school know... NaN NaN My students need shine guards, athletic socks,... 1 0 Health_Sports Health_Wellness TeamSports \r\n\"True champions aren't always the ones th...
3 45 p246581 f3cb9bffbba169bef1a77b243e620b60 Mrs. KY 2016-10-06 21:16:17 Grades PreK-2 Literacy & Language, Math & Science Literacy, Mathematics Techie Kindergarteners I work at a unique school filled with both ESL... My students live in high poverty conditions wi... NaN NaN My students need to engage in Reading and Math... 4 1 Literacy_Language Math_Science Literacy Mathematics I work at a unique school filled with both ESL...
4 172407 p104768 be1f7507a41f8479dc06f047086a39ec Mrs. TX 2016-07-11 01:10:09 Grades PreK-2 Math & Science Mathematics Interactive Math Tools Our second grade classroom next year will be m... For many students, math is a subject that does... NaN NaN My students need hands on practice in mathemat... 1 1 Math_Science Mathematics Our second grade classroom next year will be m...
In [34]:
approvedNumberOfProjects = projectsData[projectsData.project_is_approved == 1]['project_essay'].str.split().apply(len);
approvedNumberOfProjects = approvedNumberOfProjects.values
unApprovedNumberOfProjects = projectsData[projectsData.project_is_approved == 0]['project_essay'].str.split().apply(len);
unApprovedNumberOfProjects = unApprovedNumberOfProjects.values
plt.boxplot([approvedNumberOfProjects, unApprovedNumberOfProjects]);
plt.grid();
plt.xticks([1, 2], ['Approved Projects', 'UnApproved Projects']);
plt.ylabel('Number of words in project essay');
plt.show();
In [35]:
plt.figure(figsize = (10, 6));
sbrn.kdeplot(approvedNumberOfProjects, label = "Approved Projects", bw = 5);
sbrn.kdeplot(unApprovedNumberOfProjects, label = "UnApproved Projects", bw = 5);
plt.legend();
plt.show();

Observation:

  1. The approved and rejected projects overlap largely when plotted based on number of words in project_essay. So we cannot predict any observation which will be useful for classification.

Univariate Analysis: price

In [36]:
projectsData.head(5)
Out[36]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved cleaned_categories cleaned_sub_categories project_essay
0 160221 p253737 c90749f5d961ff158d4b4d1e7dc665fc Mrs. IN 2016-12-05 13:43:57 Grades PreK-2 Literacy & Language ESL, Literacy Educational Support for English Learners at Home My students are English learners that are work... \"The limits of your language are the limits o... NaN NaN My students need opportunities to practice beg... 0 0 Literacy_Language ESL Literacy My students are English learners that are work...
1 140945 p258326 897464ce9ddc600bced1151f324dd63a Mr. FL 2016-10-25 09:22:10 Grades 6-8 History & Civics, Health & Sports Civics & Government, Team Sports Wanted: Projector for Hungry Learners Our students arrive to our school eager to lea... The projector we need for our school is very c... NaN NaN My students need a projector to help with view... 7 1 History_Civics Health_Sports Civics_Government TeamSports Our students arrive to our school eager to lea...
2 21895 p182444 3465aaf82da834c0582ebd0ef8040ca0 Ms. AZ 2016-08-31 12:03:56 Grades 6-8 Health & Sports Health & Wellness, Team Sports Soccer Equipment for AWESOME Middle School Stu... \r\n\"True champions aren't always the ones th... The students on the campus come to school know... NaN NaN My students need shine guards, athletic socks,... 1 0 Health_Sports Health_Wellness TeamSports \r\n\"True champions aren't always the ones th...
3 45 p246581 f3cb9bffbba169bef1a77b243e620b60 Mrs. KY 2016-10-06 21:16:17 Grades PreK-2 Literacy & Language, Math & Science Literacy, Mathematics Techie Kindergarteners I work at a unique school filled with both ESL... My students live in high poverty conditions wi... NaN NaN My students need to engage in Reading and Math... 4 1 Literacy_Language Math_Science Literacy Mathematics I work at a unique school filled with both ESL...
4 172407 p104768 be1f7507a41f8479dc06f047086a39ec Mrs. TX 2016-07-11 01:10:09 Grades PreK-2 Math & Science Mathematics Interactive Math Tools Our second grade classroom next year will be m... For many students, math is a subject that does... NaN NaN My students need hands on practice in mathemat... 1 1 Math_Science Mathematics Our second grade classroom next year will be m...
In [37]:
resourcesData.head(5)
Out[37]:
id description quantity price
0 p233245 LC652 - Lakeshore Double-Space Mobile Drying Rack 1 149.00
1 p069063 Bouncy Bands for Desks (Blue support pipes) 3 14.95
2 p069063 Cory Stories: A Kid's Book About Living With Adhd 1 8.45
3 p069063 Dixon Ticonderoga Wood-Cased #2 HB Pencils, Bo... 2 13.59
4 p069063 EDUCATIONAL INSIGHTS FLUORESCENT LIGHT FILTERS... 3 24.95
In [38]:
# https://stackoverflow.com/questions/22407798/how-to-reset-a-dataframes-indexes-for-all-groups-in-one-step
priceAndQuantityData = resourcesData.groupby('id').agg({'price': 'sum', 'quantity': 'sum'}).reset_index();
priceAndQuantityData.head(5)
Out[38]:
id price quantity
0 p000001 459.56 7
1 p000002 515.89 21
2 p000003 298.97 4
3 p000004 1113.69 98
4 p000005 485.99 8
In [39]:
projectsData.shape
Out[39]:
(109248, 20)
In [40]:
projectsData = pd.merge(projectsData, priceAndQuantityData, on = 'id', how = 'left');
print(projectsData.shape);
projectsData.head(3)
(109248, 22)
Out[40]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved cleaned_categories cleaned_sub_categories project_essay price quantity
0 160221 p253737 c90749f5d961ff158d4b4d1e7dc665fc Mrs. IN 2016-12-05 13:43:57 Grades PreK-2 Literacy & Language ESL, Literacy Educational Support for English Learners at Home My students are English learners that are work... \"The limits of your language are the limits o... NaN NaN My students need opportunities to practice beg... 0 0 Literacy_Language ESL Literacy My students are English learners that are work... 154.60 23
1 140945 p258326 897464ce9ddc600bced1151f324dd63a Mr. FL 2016-10-25 09:22:10 Grades 6-8 History & Civics, Health & Sports Civics & Government, Team Sports Wanted: Projector for Hungry Learners Our students arrive to our school eager to lea... The projector we need for our school is very c... NaN NaN My students need a projector to help with view... 7 1 History_Civics Health_Sports Civics_Government TeamSports Our students arrive to our school eager to lea... 299.00 1
2 21895 p182444 3465aaf82da834c0582ebd0ef8040ca0 Ms. AZ 2016-08-31 12:03:56 Grades 6-8 Health & Sports Health & Wellness, Team Sports Soccer Equipment for AWESOME Middle School Stu... \r\n\"True champions aren't always the ones th... The students on the campus come to school know... NaN NaN My students need shine guards, athletic socks,... 1 0 Health_Sports Health_Wellness TeamSports \r\n\"True champions aren't always the ones th... 516.85 22
In [41]:
projectsData[projectsData['id'] == 'p253737']
Out[41]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved cleaned_categories cleaned_sub_categories project_essay price quantity
0 160221 p253737 c90749f5d961ff158d4b4d1e7dc665fc Mrs. IN 2016-12-05 13:43:57 Grades PreK-2 Literacy & Language ESL, Literacy Educational Support for English Learners at Home My students are English learners that are work... \"The limits of your language are the limits o... NaN NaN My students need opportunities to practice beg... 0 0 Literacy_Language ESL Literacy My students are English learners that are work... 154.6 23
In [42]:
priceAndQuantityData[priceAndQuantityData['id'] == 'p253737']
Out[42]:
id price quantity
253736 p253737 154.6 23
In [43]:
approvedProjectsPrice = projectsData[projectsData['project_is_approved'] == 1].price;
unApprovedProjectsPrice = projectsData[projectsData['project_is_approved'] == 0].price;
plt.boxplot([approvedProjectsPrice, unApprovedProjectsPrice]);
plt.grid();
plt.xticks([1, 2], ['Approved Projects', 'UnApproved Projects']);
plt.ylabel('Cost per project');
plt.show();
In [44]:
plt.title("Kde plot based on cost per project");
sbrn.kdeplot(approvedProjectsPrice, label = "Approved Projects", bw = 0.6);
sbrn.kdeplot(unApprovedProjectsPrice, label = "UnApproved Projects", bw = 0.6);
plt.legend();
plt.show();
In [45]:
pricePercentilesApproved = [round(np.percentile(approvedProjectsPrice, percentile), 3) for percentile in np.arange(0, 100, 5)];
pricePercentilesUnApproved = [round(np.percentile(unApprovedProjectsPrice, percentile), 3) for percentile in np.arange(0, 100, 5)];
percentileValuePricesData = pd.DataFrame({'Percentile': np.arange(0, 100, 5), 'Approved projects': pricePercentilesApproved, 'UnApproved Projects': pricePercentilesUnApproved});
percentileValuePricesData
Out[45]:
Percentile Approved projects UnApproved Projects
0 0 0.660 1.970
1 5 13.590 41.900
2 10 33.880 73.670
3 15 58.000 99.109
4 20 77.380 118.560
5 25 99.950 140.892
6 30 116.680 162.230
7 35 137.232 184.014
8 40 157.000 208.632
9 45 178.265 235.106
10 50 198.990 263.145
11 55 223.990 292.610
12 60 255.630 325.144
13 65 285.412 362.390
14 70 321.225 399.990
15 75 366.075 449.945
16 80 411.670 519.282
17 85 479.000 618.276
18 90 593.110 739.356
19 95 801.598 992.486

Observation:

  1. Most of the projects proposed are of less cost.
In [46]:
approvedProjectsQuantity = projectsData[projectsData['project_is_approved'] == 1].quantity;
unApprovedProjectsQuantity = projectsData[projectsData['project_is_approved'] == 0].quantity;
plt.boxplot([approvedProjectsQuantity, unApprovedProjectsQuantity]);
plt.grid();
plt.xticks([1, 2], ['Approved Projects', 'UnApproved Projects']);
plt.ylabel('Quantity of resources per project');
plt.show();
In [47]:
plt.title("Kde plot based on quantity of resources per project");
sbrn.kdeplot(approvedProjectsQuantity, label = "Approved Projects", bw = 0.6);
sbrn.kdeplot(unApprovedProjectsQuantity, label = "UnApproved Projects", bw = 0.6);
plt.legend();
plt.show();
In [48]:
quantityPercentilesApproved = [round(np.percentile(approvedProjectsQuantity, percentile), 3) for percentile in np.arange(0, 100, 5)];
quantityPercentilesUnApproved = [round(np.percentile(unApprovedProjectsQuantity, percentile), 3) for percentile in np.arange(0, 100, 5)];
percentileValueQuantitiesData = pd.DataFrame({'Percentile': np.arange(0, 100, 5), 'Approved projects': quantityPercentilesApproved, 'UnApproved Projects': quantityPercentilesUnApproved});
percentileValueQuantitiesData
Out[48]:
Percentile Approved projects UnApproved Projects
0 0 1.0 1.0
1 5 1.0 2.0
2 10 1.0 3.0
3 15 2.0 4.0
4 20 3.0 5.0
5 25 3.0 6.0
6 30 4.0 7.0
7 35 5.0 8.0
8 40 6.0 9.0
9 45 7.0 10.0
10 50 8.0 12.0
11 55 10.0 13.0
12 60 11.0 15.0
13 65 14.0 18.0
14 70 16.0 20.0
15 75 20.0 24.0
16 80 25.0 29.0
17 85 30.0 35.0
18 90 38.0 45.0
19 95 56.0 63.0
In [49]:
sbrn.set_style('whitegrid');
sbrn.FacetGrid(projectsData, hue = 'project_is_approved', size = 6) \
    .map(plt.scatter, 'price', 'quantity') \
    .add_legend();
plt.title("Scatter plot between price and quantity based project approval and rejection");
plt.show();

Observation:

  1. When plotted scatter plot between approved and rejected projects based on price and quantity there is huge overlap. So the projects approval is not actually depending on price and quantity resources of the project.

Univariate Analysis: teacher_number_of_previously_posted_projects

In [50]:
projectsData.head(5)
Out[50]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved cleaned_categories cleaned_sub_categories project_essay price quantity
0 160221 p253737 c90749f5d961ff158d4b4d1e7dc665fc Mrs. IN 2016-12-05 13:43:57 Grades PreK-2 Literacy & Language ESL, Literacy Educational Support for English Learners at Home My students are English learners that are work... \"The limits of your language are the limits o... NaN NaN My students need opportunities to practice beg... 0 0 Literacy_Language ESL Literacy My students are English learners that are work... 154.60 23
1 140945 p258326 897464ce9ddc600bced1151f324dd63a Mr. FL 2016-10-25 09:22:10 Grades 6-8 History & Civics, Health & Sports Civics & Government, Team Sports Wanted: Projector for Hungry Learners Our students arrive to our school eager to lea... The projector we need for our school is very c... NaN NaN My students need a projector to help with view... 7 1 History_Civics Health_Sports Civics_Government TeamSports Our students arrive to our school eager to lea... 299.00 1
2 21895 p182444 3465aaf82da834c0582ebd0ef8040ca0 Ms. AZ 2016-08-31 12:03:56 Grades 6-8 Health & Sports Health & Wellness, Team Sports Soccer Equipment for AWESOME Middle School Stu... \r\n\"True champions aren't always the ones th... The students on the campus come to school know... NaN NaN My students need shine guards, athletic socks,... 1 0 Health_Sports Health_Wellness TeamSports \r\n\"True champions aren't always the ones th... 516.85 22
3 45 p246581 f3cb9bffbba169bef1a77b243e620b60 Mrs. KY 2016-10-06 21:16:17 Grades PreK-2 Literacy & Language, Math & Science Literacy, Mathematics Techie Kindergarteners I work at a unique school filled with both ESL... My students live in high poverty conditions wi... NaN NaN My students need to engage in Reading and Math... 4 1 Literacy_Language Math_Science Literacy Mathematics I work at a unique school filled with both ESL... 232.90 4
4 172407 p104768 be1f7507a41f8479dc06f047086a39ec Mrs. TX 2016-07-11 01:10:09 Grades PreK-2 Math & Science Mathematics Interactive Math Tools Our second grade classroom next year will be m... For many students, math is a subject that does... NaN NaN My students need hands on practice in mathemat... 1 1 Math_Science Mathematics Our second grade classroom next year will be m... 67.98 4
In [51]:
previouslyPostedApprovedNumberData = projectsData.groupby('teacher_number_of_previously_posted_projects')['project_is_approved'].agg(lambda x: x.eq(1).sum()).reset_index();
previouslyPostedRejectedNumberData = projectsData.groupby('teacher_number_of_previously_posted_projects')['project_is_approved'].agg(lambda x: x.eq(0).sum()).reset_index();
print("Total number of projects approved: ", len(projectsData[projectsData['project_is_approved'] == 1]));
print("Total number of projects rejected: ", len(projectsData[projectsData['project_is_approved'] == 0]));
print("Number of projects approved categorized by previously_posted: ", previouslyPostedApprovedNumberData['project_is_approved'].sum());
print("Number of projects rejected categorized by previously_posted: ", previouslyPostedRejectedNumberData['project_is_approved'].sum());
previouslyPostedNumberData = pd.merge(previouslyPostedApprovedNumberData, previouslyPostedRejectedNumberData, on = 'teacher_number_of_previously_posted_projects', how = 'inner');
previouslyPostedNumberData.head(5)
Total number of projects approved:  92706
Total number of projects rejected:  16542
Number of projects approved categorized by previously_posted:  92706
Number of projects rejected categorized by previously_posted:  16542
Out[51]:
teacher_number_of_previously_posted_projects project_is_approved_x project_is_approved_y
0 0 24652 5362
1 1 13329 2729
2 2 8705 1645
3 3 5997 1113
4 4 4452 814
In [52]:
plt.figure(figsize = (20, 8));
plt.bar(previouslyPostedNumberData.teacher_number_of_previously_posted_projects, previouslyPostedNumberData.project_is_approved_x);
plt.bar(previouslyPostedNumberData.teacher_number_of_previously_posted_projects, previouslyPostedNumberData.project_is_approved_y);
plt.show();
In [53]:
previouslyPostedApprovedData = projectsData[projectsData['project_is_approved'] == 1].teacher_number_of_previously_posted_projects;
previouslyPostedRejectedData = projectsData[projectsData['project_is_approved'] == 0].teacher_number_of_previously_posted_projects;
plt.boxplot([previouslyPostedApprovedData, previouslyPostedRejectedData]);
plt.grid();
plt.xticks([1, 2], ['Approved Projects', 'Rejected Projects']);
plt.ylabel('Previously posted number of projects');
plt.show();
In [54]:
sbrn.kdeplot(previouslyPostedApprovedData, label = "Approved projects", bw = 1);
sbrn.kdeplot(previouslyPostedRejectedData, label = "Rejected projects", bw = 1);
plt.show();

Observation:

  1. Most of the projects approved and rejected are with less number of teacher_number_of_previously_posted_projects. So the approval is not much depending on how many number of projects proposed by teacher previously.
In [0]:
def stringContainsNumbers(string):
    return any([character.isdigit() for character in string])
In [56]:
numericResourceApprovedData = projectsData[(projectsData['project_resource_summary'].apply(stringContainsNumbers) == True) & (projectsData['project_is_approved'] == 1)]
textResourceApprovedData = projectsData[(projectsData['project_resource_summary'].apply(stringContainsNumbers) == False) & (projectsData['project_is_approved'] == 1)]
numericResourceRejectedData = projectsData[(projectsData['project_resource_summary'].apply(stringContainsNumbers) == True) & (projectsData['project_is_approved'] == 0)]
textResourceRejectedData = projectsData[(projectsData['project_resource_summary'].apply(stringContainsNumbers) == False) & (projectsData['project_is_approved'] == 0)]
print("Checking whether numbers in resource summary will be useful for project approval?");
equalsBorder(70);
print("Number of approved projects with numbers in resource summary: ", numericResourceApprovedData.shape[0]);
print("Number of rejected projects with numbers in resource summary: ", numericResourceRejectedData.shape[0]);
print("Number of approved projects without numbers in resource summary: ", textResourceApprovedData.shape[0]);
print("Number of rejected projects without numbers in resource summary: ", textResourceRejectedData.shape[0]);
Checking whether numbers in resource summary will be useful for project approval?
======================================================================
Number of approved projects with numbers in resource summary:  14090
Number of rejected projects with numbers in resource summary:  1666
Number of approved projects without numbers in resource summary:  78616
Number of rejected projects without numbers in resource summary:  14876

Observation:

  1. The rejection rate of project is less when projects resource summary has numbers in it.
  2. Even the number of projects approved without numbers in resource summary is high which means that the classification does not actually depends on whether resource summary contains numerical digits or not.

Conclusion of univariate analysis:

  1. There is huge overlap of approved and rejected projects when taken for all single featurs. So, this project cannot be classified using single features.
  2. project_title is some what better in text type of feature because of less overlap than others.
  3. The project approval is not depending on resources cost, but the probability of project rejection is more when resources cost is more.

Preprocessing data

In [0]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# All stopwords that are needed to be removed in the text
stopWords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"]);
def preProcessingWithAndWithoutStopWords(texts):
    """
    This function takes list of texts and returns preprocessed list of texts one with
    stop words and one without stopwords.
    """
    # Variable for storing preprocessed text with stop words
    preProcessedTextsWithStopWords = [];
    # Variable for storing preprocessed text without stop words
    preProcessedTextsWithoutStopWords = [];
    
    # Looping over list of texts for performing pre processing
    for text in tqdm(texts, total = len(texts)):
        # Removing all links in the text
        text = re.sub(r"http\S+", "", text);

        # Removing all html tags in the text
        text = re.sub(r"<\w+/>", "", text);
        text = re.sub(r"<\w+>", "", text);
        
        # https://stackoverflow.com/a/47091490/4084039
        # Replacing all below words with adverbs
        text = re.sub(r"won't", "will not", text)
        text = re.sub(r"can\'t", "can not", text)
        text = re.sub(r"n\'t", " not", text)
        text = re.sub(r"\'re", " are", text)
        text = re.sub(r"\'s", " is", text)
        text = re.sub(r"\'d", " would", text)
        text = re.sub(r"\'ll", " will", text)
        text = re.sub(r"\'t", " not", text)
        text = re.sub(r"\'ve", " have", text)
        text = re.sub(r"\'m", " am", text)
        
        # Removing backslash symbols in text
        text = text.replace('\\r', ' ');
        text = text.replace('\\n', ' ');
        text = text.replace('\\"', ' ');
        
        # Removing all special characters of text
        text = re.sub(r"[^a-zA-Z0-9]+", " ", text);
        
        # Converting whole review text into lower case
        text = text.lower();
        
        # adding this preprocessed text without stopwords to list
        preProcessedTextsWithStopWords.append(text);
        
        # removing stop words from text
        textWithoutStopWords = ' '.join([word for word in text.split() if word not in stopWords]);
        # adding this preprocessed text without stopwords to list
        preProcessedTextsWithoutStopWords.append(textWithoutStopWords);

    return [preProcessedTextsWithStopWords, preProcessedTextsWithoutStopWords];
In [58]:
texts = [projectsData['project_essay'].values[0]]
preProcessedTextsWithStopWords, preProcessedTextsWithoutStopWords = preProcessingWithAndWithoutStopWords(texts);
print("Example project essay without pre-processing: ");
equalsBorder(70);
print(texts);
equalsBorder(70);
print("Example project essay with stop words and pre-processing: ");
equalsBorder(70);
print(preProcessedTextsWithStopWords);
equalsBorder(70);
print("Example project essay without stop words and pre-processing: ");
equalsBorder(70);
print(preProcessedTextsWithoutStopWords);
Example project essay without pre-processing: 
======================================================================
['My students are English learners that are working on English as their second or third languages. We are a melting pot of refugees, immigrants, and native-born Americans bringing the gift of language to our school. \\r\\n\\r\\n We have over 24 languages represented in our English Learner program with students at every level of mastery.  We also have over 40 countries represented with the families within our school.  Each student brings a wealth of knowledge and experiences to us that open our eyes to new cultures, beliefs, and respect.\\"The limits of your language are the limits of your world.\\"-Ludwig Wittgenstein  Our English learner\'s have a strong support system at home that begs for more resources.  Many times our parents are learning to read and speak English along side of their children.  Sometimes this creates barriers for parents to be able to help their child learn phonetics, letter recognition, and other reading skills.\\r\\n\\r\\nBy providing these dvd\'s and players, students are able to continue their mastery of the English language even if no one at home is able to assist.  All families with students within the Level 1 proficiency status, will be a offered to be a part of this program.  These educational videos will be specially chosen by the English Learner Teacher and will be sent home regularly to watch.  The videos are to help the child develop early reading skills.\\r\\n\\r\\nParents that do not have access to a dvd player will have the opportunity to check out a dvd player to use for the year.  The plan is to use these videos and educational dvd\'s for the years to come for other EL students.\\r\\nnannan']
======================================================================
Example project essay with stop words and pre-processing: 
======================================================================
['my students are english learners that are working on english as their second or third languages we are a melting pot of refugees immigrants and native born americans bringing the gift of language to our school we have over 24 languages represented in our english learner program with students at every level of mastery we also have over 40 countries represented with the families within our school each student brings a wealth of knowledge and experiences to us that open our eyes to new cultures beliefs and respect the limits of your language are the limits of your world ludwig wittgenstein our english learner is have a strong support system at home that begs for more resources many times our parents are learning to read and speak english along side of their children sometimes this creates barriers for parents to be able to help their child learn phonetics letter recognition and other reading skills by providing these dvd is and players students are able to continue their mastery of the english language even if no one at home is able to assist all families with students within the level 1 proficiency status will be a offered to be a part of this program these educational videos will be specially chosen by the english learner teacher and will be sent home regularly to watch the videos are to help the child develop early reading skills parents that do not have access to a dvd player will have the opportunity to check out a dvd player to use for the year the plan is to use these videos and educational dvd is for the years to come for other el students nannan']
======================================================================
Example project essay without stop words and pre-processing: 
======================================================================
['students english learners working english second third languages melting pot refugees immigrants native born americans bringing gift language school 24 languages represented english learner program students every level mastery also 40 countries represented families within school student brings wealth knowledge experiences us open eyes new cultures beliefs respect limits language limits world ludwig wittgenstein english learner strong support system home begs resources many times parents learning read speak english along side children sometimes creates barriers parents able help child learn phonetics letter recognition reading skills providing dvd players students able continue mastery english language even no one home able assist families students within level 1 proficiency status offered part program educational videos specially chosen english learner teacher sent home regularly watch videos help child develop early reading skills parents not access dvd player opportunity check dvd player use year plan use videos educational dvd years come el students nannan']
In [59]:
projectEssays = projectsData['project_essay'];
preProcessedEssaysWithStopWords, preProcessedEssaysWithoutStopWords = preProcessingWithAndWithoutStopWords(projectEssays);

In [60]:
preProcessedEssaysWithoutStopWords[0:3]
Out[60]:
['students english learners working english second third languages melting pot refugees immigrants native born americans bringing gift language school 24 languages represented english learner program students every level mastery also 40 countries represented families within school student brings wealth knowledge experiences us open eyes new cultures beliefs respect limits language limits world ludwig wittgenstein english learner strong support system home begs resources many times parents learning read speak english along side children sometimes creates barriers parents able help child learn phonetics letter recognition reading skills providing dvd players students able continue mastery english language even no one home able assist families students within level 1 proficiency status offered part program educational videos specially chosen english learner teacher sent home regularly watch videos help child develop early reading skills parents not access dvd player opportunity check dvd player use year plan use videos educational dvd years come el students nannan',
 'students arrive school eager learn polite generous strive best know education succeed life help improve lives school focuses families low incomes tries give student education deserve not much students use materials given best projector need school crucial academic improvement students technology continues grow many resources internet teachers use growth students however school limited resources particularly technology without disadvantage one things could really help classrooms projector projector not crucial instruction also growth students projector show presentations documentaries photos historical land sites math problems much projector make teaching learning easier also targeting different types learners classrooms auditory visual kinesthetic etc nannan',
 'true champions not always ones win guts mia hamm quote best describes students cholla middle school approach playing sports especially girls boys soccer teams teams made 7th 8th grade students not opportunity play organized sport due family financial difficulties teach title one middle school urban neighborhood 74 students qualify free reduced lunch many come activity sport opportunity poor homes students love participate sports learn new skills apart team atmosphere school lacks funding meet students needs concerned lack exposure not prepare participating sports teams high school end school year goal provide students opportunity learn variety soccer skills positive qualities person actively participates team students campus come school knowing face uphill battle comes participating organized sports players would thrive field confidence appropriate soccer equipment play soccer best abilities students experience helpful person part team teaches positive supportive encouraging others students using soccer equipment practice games daily basis learn practice necessary skills develop strong soccer team experience create opportunity students learn part team positive contribution teammates students get opportunity learn practice variety soccer skills use skills game access type experience nearly impossible without soccer equipment students players utilize practice games nannan']
In [61]:
projectTitles = projectsData['project_title'];
preProcessedProjectTitlesWithStopWords, preProcessedProjectTitlesWithoutStopWords = preProcessingWithAndWithoutStopWords(projectTitles);
preProcessedProjectTitlesWithoutStopWords[0:5]

Out[61]:
['educational support english learners home',
 'wanted projector hungry learners',
 'soccer equipment awesome middle school students',
 'techie kindergarteners',
 'interactive math tools']
In [62]:
projectsData['preprocessed_titles'] = preProcessedProjectTitlesWithoutStopWords;
projectsData['preprocessed_essays'] = preProcessedEssaysWithoutStopWords;
projectsData.shape
Out[62]:
(109248, 24)

Preparing data for classification and modelling

In [63]:
pd.DataFrame(projectsData.columns, columns = ['All features in projects data'])
Out[63]:
All features in projects data
0 Unnamed: 0
1 id
2 teacher_id
3 teacher_prefix
4 school_state
5 project_submitted_datetime
6 project_grade_category
7 project_subject_categories
8 project_subject_subcategories
9 project_title
10 project_essay_1
11 project_essay_2
12 project_essay_3
13 project_essay_4
14 project_resource_summary
15 teacher_number_of_previously_posted_projects
16 project_is_approved
17 cleaned_categories
18 cleaned_sub_categories
19 project_essay
20 price
21 quantity
22 preprocessed_titles
23 preprocessed_essays

Useful features:

Here we will consider only below features for classification and we can ignore the other features

Categorical data:
  1. school_state - categorical data
  2. project_grade_category - categorical data
  3. cleaned_categories - categorical data
  4. cleaned_sub_categories - categorical data
  5. teacher_prefix - categorical data
Text data:
  1. project_resource_summary - text data
  2. project_title - text data
  3. project_resource_summary - text data
Numerical data:
  1. teacher_number_of_previously_posted_projects - numerical data
  2. price - numerical data
  3. quantity - numerical data

Vectorizing categorical data

1. Vectorizing cleaned_categories(project_subject_categories cleaned) - One Hot Encoding

In [0]:
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique cleaned_categories
subjectsCategoriesVectorizer = CountVectorizer(vocabulary = list(sortedCategoriesDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with cleaned_categories values
subjectsCategoriesVectorizer.fit(projectsData['cleaned_categories'].values);
# Vectorizing categories using one-hot-encoding
categoriesVectors = subjectsCategoriesVectorizer.transform(projectsData['cleaned_categories'].values);
In [0]:
print("Features used in vectorizing categories: ");
equalsBorder(70);
print(subjectsCategoriesVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of cleaned_categories matrix after vectorization(one-hot-encoding): ", categoriesVectors.shape);
equalsBorder(70);
print("Sample vectors of categories: ");
equalsBorder(70);
print(categoriesVectors[0:4])
Features used in vectorizing categories: 
======================================================================
['Warmth', 'Care_Hunger', 'History_Civics', 'Music_Arts', 'AppliedLearning', 'SpecialNeeds', 'Health_Sports', 'Math_Science', 'Literacy_Language']
======================================================================
Shape of cleaned_categories matrix after vectorization(one-hot-encoding):  (109248, 9)
======================================================================
Sample vectors of categories: 
======================================================================
  (0, 8)	1
  (1, 2)	1
  (1, 6)	1
  (2, 6)	1
  (3, 7)	1
  (3, 8)	1

2. Vectorizing cleaned_sub_categories(project_subject_sub_categories cleaned) - One Hot Encoding

In [0]:
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique cleaned_sub_categories
subjectsSubCategoriesVectorizer = CountVectorizer(vocabulary = list(sortedDictionarySubCategories.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with cleaned_sub_categories values
subjectsSubCategoriesVectorizer.fit(projectsData['cleaned_sub_categories'].values);
# Vectorizing sub categories using one-hot-encoding
subCategoriesVectors = subjectsSubCategoriesVectorizer.transform(projectsData['cleaned_sub_categories'].values);
In [0]:
print("Features used in vectorizing subject sub categories: ");
equalsBorder(70);
print(subjectsSubCategoriesVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of cleaned_categories matrix after vectorization(one-hot-encoding): ", subCategoriesVectors.shape);
equalsBorder(70);
print("Sample vectors of categories: ");
equalsBorder(70);
print(subCategoriesVectors[0:4])
Features used in vectorizing subject sub categories: 
======================================================================
['Economics', 'CommunityService', 'FinancialLiteracy', 'ParentInvolvement', 'Extracurricular', 'Civics_Government', 'ForeignLanguages', 'NutritionEducation', 'Warmth', 'Care_Hunger', 'SocialSciences', 'PerformingArts', 'CharacterEducation', 'TeamSports', 'Other', 'College_CareerPrep', 'Music', 'History_Geography', 'Health_LifeScience', 'EarlyDevelopment', 'ESL', 'Gym_Fitness', 'EnvironmentalScience', 'VisualArts', 'Health_Wellness', 'AppliedSciences', 'SpecialNeeds', 'Literature_Writing', 'Mathematics', 'Literacy']
======================================================================
Shape of cleaned_categories matrix after vectorization(one-hot-encoding):  (109248, 30)
======================================================================
Sample vectors of categories: 
======================================================================
  (0, 20)	1
  (0, 29)	1
  (1, 5)	1
  (1, 13)	1
  (2, 13)	1
  (2, 24)	1
  (3, 28)	1
  (3, 29)	1

3. Vectorizing teacher_prefix - One Hot Encoding

In [0]:
def giveCounter(data):
    counter = Counter();
    for dataValue in data:
        counter.update(str(dataValue).split());
    return counter
In [0]:
giveCounter(projectsData['teacher_prefix'].values)
Out[0]:
Counter({'Mrs.': 57269,
         'Mr.': 10648,
         'Ms.': 38955,
         'Teacher': 2360,
         'nan': 3,
         'Dr.': 13})
In [0]:
projectsData = projectsData.dropna(subset = ['teacher_prefix']);
projectsData.shape
Out[0]:
(109245, 22)
In [0]:
teacherPrefixDictionary = dict(giveCounter(projectsData['teacher_prefix'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique teacher_prefix
teacherPrefixVectorizer = CountVectorizer(vocabulary = list(teacherPrefixDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with teacher_prefix values
teacherPrefixVectorizer.fit(projectsData['teacher_prefix'].values);
# Vectorizing teacher_prefix using one-hot-encoding
teacherPrefixVectors = teacherPrefixVectorizer.transform(projectsData['teacher_prefix'].values);
In [0]:
print("Features used in vectorizing teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of teacher_prefix matrix after vectorization(one-hot-encoding): ", teacherPrefixVectors.shape);
equalsBorder(70);
print("Sample vectors of teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectors[0:100]);
Features used in vectorizing teacher_prefix: 
======================================================================
['Mrs.', 'Mr.', 'Ms.', 'Teacher', 'Dr.']
======================================================================
Shape of teacher_prefix matrix after vectorization(one-hot-encoding):  (109245, 5)
======================================================================
Sample vectors of teacher_prefix: 
======================================================================
  (27, 3)	1
  (75, 3)	1
  (82, 3)	1
  (88, 3)	1
In [0]:
teacherPrefixes = [prefix.replace('.', '') for prefix in projectsData['teacher_prefix'].values];
teacherPrefixes[0:5]
Out[0]:
['Mrs', 'Mr', 'Ms', 'Mrs', 'Mrs']
In [0]:
projectsData['teacher_prefix'] = teacherPrefixes;
projectsData.head(3)
Out[0]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title ... project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved cleaned_categories cleaned_sub_categories project_essay price quantity
0 160221 p253737 c90749f5d961ff158d4b4d1e7dc665fc Mrs IN 2016-12-05 13:43:57 Grades PreK-2 Literacy & Language ESL, Literacy Educational Support for English Learners at Home ... NaN NaN My students need opportunities to practice beg... 0 0 Literacy_Language ESL Literacy My students are English learners that are work... 154.60 23
1 140945 p258326 897464ce9ddc600bced1151f324dd63a Mr FL 2016-10-25 09:22:10 Grades 6-8 History & Civics, Health & Sports Civics & Government, Team Sports Wanted: Projector for Hungry Learners ... NaN NaN My students need a projector to help with view... 7 1 History_Civics Health_Sports Civics_Government TeamSports Our students arrive to our school eager to lea... 299.00 1
2 21895 p182444 3465aaf82da834c0582ebd0ef8040ca0 Ms AZ 2016-08-31 12:03:56 Grades 6-8 Health & Sports Health & Wellness, Team Sports Soccer Equipment for AWESOME Middle School Stu... ... NaN NaN My students need shine guards, athletic socks,... 1 0 Health_Sports Health_Wellness TeamSports \r\n\"True champions aren't always the ones th... 516.85 22

3 rows × 22 columns

In [0]:
teacherPrefixDictionary = dict(giveCounter(projectsData['teacher_prefix'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique teacher_prefix
teacherPrefixVectorizer = CountVectorizer(vocabulary = list(teacherPrefixDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with teacher_prefix values
teacherPrefixVectorizer.fit(projectsData['teacher_prefix'].values);
# Vectorizing teacher_prefix using one-hot-encoding
teacherPrefixVectors = teacherPrefixVectorizer.transform(projectsData['teacher_prefix'].values);
In [0]:
print("Features used in vectorizing teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of teacher_prefix matrix after vectorization(one-hot-encoding): ", teacherPrefixVectors.shape);
equalsBorder(70);
print("Sample vectors of teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectors[0:4]);
Features used in vectorizing teacher_prefix: 
======================================================================
['Mrs', 'Mr', 'Ms', 'Teacher', 'Dr']
======================================================================
Shape of teacher_prefix matrix after vectorization(one-hot-encoding):  (109245, 5)
======================================================================
Sample vectors of teacher_prefix: 
======================================================================
  (0, 0)	1
  (1, 1)	1
  (2, 2)	1
  (3, 0)	1

4. Vectorizing school_state - One Hot Encoding

In [0]:
schoolStateDictionary = dict(giveCounter(projectsData['school_state'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique school states
schoolStateVectorizer = CountVectorizer(vocabulary = list(schoolStateDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with school_state values
schoolStateVectorizer.fit(projectsData['school_state'].values);
# Vectorizing school_state using one-hot-encoding
schoolStateVectors = schoolStateVectorizer.transform(projectsData['school_state'].values);
In [0]:
print("Features used in vectorizing school_state: ");
equalsBorder(70);
print(schoolStateVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of school_state matrix after vectorization(one-hot-encoding): ", schoolStateVectors.shape);
equalsBorder(70);
print("Sample vectors of school_state: ");
equalsBorder(70);
print(schoolStateVectors[0:4]);
Features used in vectorizing school_state: 
======================================================================
['IN', 'FL', 'AZ', 'KY', 'TX', 'CT', 'GA', 'SC', 'NC', 'CA', 'NY', 'OK', 'MA', 'NV', 'OH', 'PA', 'AL', 'LA', 'VA', 'AR', 'WA', 'WV', 'ID', 'TN', 'MS', 'CO', 'UT', 'IL', 'MI', 'HI', 'IA', 'RI', 'NJ', 'MO', 'DE', 'MN', 'ME', 'WY', 'ND', 'OR', 'AK', 'MD', 'WI', 'SD', 'NE', 'NM', 'DC', 'KS', 'MT', 'NH', 'VT']
======================================================================
Shape of school_state matrix after vectorization(one-hot-encoding):  (109245, 51)
======================================================================
Sample vectors of school_state: 
======================================================================
  (0, 0)	1
  (1, 1)	1
  (2, 2)	1
  (3, 3)	1

5. Vectorizing project_grade_category - One Hot Encoding

In [0]:
giveCounter(projectsData['project_grade_category'])
Out[0]:
Counter({'Grades': 109245,
         'PreK-2': 44225,
         '6-8': 16923,
         '3-5': 37135,
         '9-12': 10962})
In [0]:
cleanedGrades = []
for grade in projectsData['project_grade_category'].values:
    grade = grade.replace(' ', '');
    grade = grade.replace('-', 'to');
    cleanedGrades.append(grade);
cleanedGrades[0:4]
Out[0]:
['GradesPreKto2', 'Grades6to8', 'Grades6to8', 'GradesPreKto2']
In [0]:
projectsData['project_grade_category'] = cleanedGrades
projectsData.head(4)
Out[0]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title ... project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved cleaned_categories cleaned_sub_categories project_essay price quantity
0 160221 p253737 c90749f5d961ff158d4b4d1e7dc665fc Mrs IN 2016-12-05 13:43:57 GradesPreKto2 Literacy & Language ESL, Literacy Educational Support for English Learners at Home ... NaN NaN My students need opportunities to practice beg... 0 0 Literacy_Language ESL Literacy My students are English learners that are work... 154.60 23
1 140945 p258326 897464ce9ddc600bced1151f324dd63a Mr FL 2016-10-25 09:22:10 Grades6to8 History & Civics, Health & Sports Civics & Government, Team Sports Wanted: Projector for Hungry Learners ... NaN NaN My students need a projector to help with view... 7 1 History_Civics Health_Sports Civics_Government TeamSports Our students arrive to our school eager to lea... 299.00 1
2 21895 p182444 3465aaf82da834c0582ebd0ef8040ca0 Ms AZ 2016-08-31 12:03:56 Grades6to8 Health & Sports Health & Wellness, Team Sports Soccer Equipment for AWESOME Middle School Stu... ... NaN NaN My students need shine guards, athletic socks,... 1 0 Health_Sports Health_Wellness TeamSports \r\n\"True champions aren't always the ones th... 516.85 22
3 45 p246581 f3cb9bffbba169bef1a77b243e620b60 Mrs KY 2016-10-06 21:16:17 GradesPreKto2 Literacy & Language, Math & Science Literacy, Mathematics Techie Kindergarteners ... NaN NaN My students need to engage in Reading and Math... 4 1 Literacy_Language Math_Science Literacy Mathematics I work at a unique school filled with both ESL... 232.90 4

4 rows × 22 columns

In [0]:
projectGradeDictionary = dict(giveCounter(projectsData['project_grade_category'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique project grade categories
projectGradeVectorizer = CountVectorizer(vocabulary = list(projectGradeDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with project_grade_category values
projectGradeVectorizer.fit(projectsData['project_grade_category'].values);
# Vectorizing project_grade_category using one-hot-encoding
projectGradeVectors = projectGradeVectorizer.transform(projectsData['project_grade_category'].values);
In [0]:
print("Features used in vectorizing project_grade_category: ");
equalsBorder(70);
print(projectGradeVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of school_state matrix after vectorization(one-hot-encoding): ", projectGradeVectors.shape);
equalsBorder(70);
print("Sample vectors of school_state: ");
equalsBorder(70);
print(projectGradeVectors[0:4]);
Features used in vectorizing project_grade_category: 
======================================================================
['GradesPreKto2', 'Grades6to8', 'Grades3to5', 'Grades9to12']
======================================================================
Shape of school_state matrix after vectorization(one-hot-encoding):  (109245, 4)
======================================================================
Sample vectors of school_state: 
======================================================================
  (0, 0)	1
  (1, 1)	1
  (2, 1)	1
  (3, 0)	1
In [0]:
projectsDataSub = projectsData[0:40000];
preProcessedEssaysWithoutStopWordsSub = preProcessedEssaysWithoutStopWords[0:40000];
preProcessedProjectTitlesWithoutStopWordsSub = preProcessedProjectTitlesWithoutStopWords[0:40000];

Vectorizing Text Data

Bag of Words

1. Vectorizing project_essay

In [0]:
# Initializing countvectorizer for bag of words vectorization of preprocessed project essays
bowEssayVectorizer = CountVectorizer(min_df = 10);
# Transforming the preprocessed essays to bag of words vectors
bowEssayModel = bowEssayVectorizer.fit_transform(preProcessedEssaysWithoutStopWordsSub);
In [0]:
print("Some of the Features used in vectorizing preprocessed essays: ");
equalsBorder(70);
print(bowEssayVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed essay matrix after vectorization: ", bowEssayModel.shape);
equalsBorder(70);
print("Sample bag-of-words vector of preprocessed essay: ");
equalsBorder(70);
print(bowEssayModel[0])
Some of the Features used in vectorizing preprocessed essays: 
======================================================================
['yeats', 'yell', 'yelling', 'yellow', 'yemen', 'yes', 'yesterday', 'yet', 'yield', 'yields', 'yoga', 'york', 'younannan', 'young', 'younger', 'youngest', 'youngsters', 'youth', 'youthful', 'youths', 'youtube', 'yummy', 'zeal', 'zearn', 'zen', 'zenergy', 'zero', 'zest', 'zip', 'ziploc', 'zippers', 'zipping', 'zone', 'zoned', 'zones', 'zoo', 'zoom', 'zooming', 'zoos', 'zumba']
======================================================================
Shape of preprocessed essay matrix after vectorization:  (40000, 11077)
======================================================================
Sample bag-of-words vector of preprocessed essay: 
======================================================================
  (0, 6533)	1
  (0, 3306)	1
  (0, 1981)	1
  (0, 11036)	1
  (0, 7347)	1
  (0, 11029)	1
  (0, 10530)	2
  (0, 1734)	1
  (0, 6855)	1
  (0, 7374)	2
  (0, 232)	1
  (0, 6687)	1
  (0, 3211)	1
  (0, 2805)	1
  (0, 10766)	1
  (0, 8133)	1
  (0, 8803)	1
  (0, 9831)	1
  (0, 1794)	1
  (0, 9237)	1
  (0, 10639)	3
  (0, 3274)	2
  (0, 7068)	1
  (0, 6798)	1
  (0, 9399)	1
  :	:
  (0, 6123)	2
  (0, 5785)	2
  (0, 3613)	1
  (0, 7703)	2
  (0, 5732)	3
  (0, 8269)	2
  (0, 67)	1
  (0, 8670)	2
  (0, 5664)	3
  (0, 4383)	1
  (0, 1339)	1
  (0, 553)	1
  (0, 1248)	1
  (0, 6549)	1
  (0, 5003)	1
  (0, 8116)	1
  (0, 7501)	1
  (0, 6207)	1
  (0, 5665)	2
  (0, 9968)	1
  (0, 8736)	1
  (0, 10964)	1
  (0, 5733)	1
  (0, 3449)	7
  (0, 9553)	5

2. Vectorizing project_title

In [0]:
# Initializing countvectorizer for bag of words vectorization of preprocessed project titles
bowTitleVectorizer = CountVectorizer(min_df = 10);
# Transforming the preprocessed project titles to bag of words vectors
bowTitleModel = bowTitleVectorizer.fit_transform(preProcessedProjectTitlesWithoutStopWordsSub);
In [0]:
print("Some of the Features used in vectorizing preprocessed titles: ");
equalsBorder(70);
print(bowTitleVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after vectorization: ", bowTitleModel.shape);
equalsBorder(70);
print("Sample bag-of-words vector of preprocessed title: ");
equalsBorder(70);
print(bowTitleModel[0])
Some of the Features used in vectorizing preprocessed titles: 
======================================================================
['wireless', 'wise', 'wish', 'within', 'without', 'wizards', 'wo', 'wobble', 'wobbles', 'wobbling', 'wobbly', 'wonder', 'wonderful', 'wonders', 'word', 'words', 'work', 'workers', 'working', 'works', 'workshop', 'world', 'worlds', 'worms', 'worth', 'would', 'wow', 'write', 'writer', 'writers', 'writing', 'ye', 'year', 'yearbook', 'yes', 'yoga', 'young', 'youth', 'zone', 'zoom']
======================================================================
Shape of preprocessed title matrix after vectorization:  (40000, 1774)
======================================================================
Sample bag-of-words vector of preprocessed title: 
======================================================================
  (0, 766)	1
  (0, 906)	1
  (0, 514)	1
  (0, 1553)	1
  (0, 483)	1

Tf-Idf Vectorization

1. Vectorizing project_essay

In [0]:
# Intializing tfidf vectorizer for tf-idf vectorization of preprocessed project essays
tfIdfEssayVectorizer = TfidfVectorizer(min_df = 10);
# Transforming the preprocessed project essays to tf-idf vectors
tfIdfEssayModel = tfIdfEssayVectorizer.fit_transform(preProcessedEssaysWithoutStopWordsSub);
In [0]:
print("Some of the Features used in tf-idf vectorizing preprocessed essays: ");
equalsBorder(70);
print(tfIdfEssayVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after tf-idf vectorization: ", tfIdfEssayModel.shape);
equalsBorder(70);
print("Sample Tf-Idf vector of preprocessed essay: ");
equalsBorder(70);
print(tfIdfEssayModel[0])
Some of the Features used in tf-idf vectorizing preprocessed essays: 
======================================================================
['yeats', 'yell', 'yelling', 'yellow', 'yemen', 'yes', 'yesterday', 'yet', 'yield', 'yields', 'yoga', 'york', 'younannan', 'young', 'younger', 'youngest', 'youngsters', 'youth', 'youthful', 'youths', 'youtube', 'yummy', 'zeal', 'zearn', 'zen', 'zenergy', 'zero', 'zest', 'zip', 'ziploc', 'zippers', 'zipping', 'zone', 'zoned', 'zones', 'zoo', 'zoom', 'zooming', 'zoos', 'zumba']
======================================================================
Shape of preprocessed title matrix after tf-idf vectorization:  (40000, 11077)
======================================================================
Sample Tf-Idf vector of preprocessed essay: 
======================================================================
  (0, 9553)	0.07732161197654648
  (0, 3449)	0.2978137199079083
  (0, 5733)	0.03611311825070974
  (0, 10964)	0.03819325396356506
  (0, 8736)	0.04966730436190034
  (0, 9968)	0.05933894161734909
  (0, 5665)	0.13189136979245247
  (0, 6207)	0.09909858268088724
  (0, 7501)	0.09797369103397546
  (0, 8116)	0.09716121418147701
  (0, 5003)	0.09174889764250635
  (0, 6549)	0.07739523816315956
  (0, 1248)	0.09041771504928811
  (0, 553)	0.09502243963232913
  (0, 1339)	0.07922532406820633
  (0, 4383)	0.08387324724715874
  (0, 5664)	0.12052414724469786
  (0, 8670)	0.03565737676523101
  (0, 67)	0.0797508795755641
  (0, 8269)	0.18440093271700464
  (0, 5732)	0.23244852084297085
  (0, 7703)	0.0932371184396508
  (0, 3613)	0.033250154942777416
  (0, 5785)	0.08336998078832462
  (0, 6123)	0.18451571587493337
  :	:
  (0, 9399)	0.0680639151319745
  (0, 6798)	0.08632328546640713
  (0, 7068)	0.046135007257522224
  (0, 3274)	0.10489683635458984
  (0, 10639)	0.2063461965343629
  (0, 9237)	0.1100116652395096
  (0, 1794)	0.07900547931629058
  (0, 9831)	0.03792376194008962
  (0, 8803)	0.09740047454864696
  (0, 8133)	0.09001501053091984
  (0, 10766)	0.07024528926492071
  (0, 2805)	0.05089165427462248
  (0, 3211)	0.06222851802675729
  (0, 6687)	0.022226920710368445
  (0, 232)	0.040248356980164615
  (0, 7374)	0.1846309297399045
  (0, 6855)	0.03799907965204156
  (0, 1734)	0.07743897673831124
  (0, 10530)	0.05491069896079749
  (0, 11029)	0.030886589234837624
  (0, 7347)	0.06268239285732621
  (0, 11036)	0.04610937510882687
  (0, 1981)	0.02654012905964554
  (0, 3306)	0.1031894334469226
  (0, 6533)	0.016043824658976313

2. Vectorizing project_title

In [0]:
# Intializing tfidf vectorizer for tf-idf vectorization of preprocessed project titles
tfIdfTitleVectorizer = TfidfVectorizer(min_df = 10);
# Transforming the preprocessed project titles to tf-idf vectors
tfIdfTitleModel = tfIdfTitleVectorizer.fit_transform(preProcessedProjectTitlesWithoutStopWordsSub);
In [0]:
print("Some of the Features used in tf-idf vectorizing preprocessed titles: ");
equalsBorder(70);
print(tfIdfTitleVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after tf-idf vectorization: ", tfIdfTitleModel.shape);
equalsBorder(70);
print("Sample Tf-Idf vector of preprocessed title: ");
equalsBorder(70);
print(tfIdfTitleModel[0])
Some of the Features used in tf-idf vectorizing preprocessed titles: 
======================================================================
['wireless', 'wise', 'wish', 'within', 'without', 'wizards', 'wo', 'wobble', 'wobbles', 'wobbling', 'wobbly', 'wonder', 'wonderful', 'wonders', 'word', 'words', 'work', 'workers', 'working', 'works', 'workshop', 'world', 'worlds', 'worms', 'worth', 'would', 'wow', 'write', 'writer', 'writers', 'writing', 'ye', 'year', 'yearbook', 'yes', 'yoga', 'young', 'youth', 'zone', 'zoom']
======================================================================
Shape of preprocessed title matrix after tf-idf vectorization:  (40000, 1774)
======================================================================
Sample Tf-Idf vector of preprocessed title: 
======================================================================
  (0, 483)	0.5356140846908081
  (0, 1553)	0.4441059196924978
  (0, 514)	0.4615835742389133
  (0, 906)	0.3400969810242112
  (0, 766)	0.4326223894644794

Average Word2Vector Vectorization

In [0]:
# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/
# We should have glove_vectors file for creating below model
with open('glove_vectors', 'rb') as f:
    gloveModel = pickle.load(f)
    gloveWords =  set(gloveModel.keys())
In [0]:
print("Glove vector of sample word: ");
equalsBorder(70);
print(gloveModel['technology']);
equalsBorder(70);
print("Shape of glove vector: ", gloveModel['technology'].shape);
Glove vector of sample word: 
======================================================================
[-0.26078   -0.36898   -0.022831   0.21666    0.16672   -0.20268
 -3.1219     0.33057    0.71512    0.28874    0.074368  -0.033203
  0.23783    0.21052    0.076562   0.13007   -0.31706   -0.45888
 -0.45463   -0.13191    0.49761    0.072704   0.16811    0.18846
 -0.16688   -0.21973    0.08575   -0.19577   -0.2101    -0.32436
 -0.56336    0.077996  -0.22758   -0.66569    0.14824    0.038945
  0.50881   -0.1352     0.49966   -0.4401    -0.022335  -0.22744
  0.22086    0.21865    0.36647    0.30495   -0.16565    0.038759
  0.28108   -0.2167     0.12453    0.65401    0.34584   -0.2557
 -0.046363  -0.31111   -0.020936  -0.17122   -0.77114    0.29289
 -0.14625    0.39541   -0.078938   0.051127   0.15076    0.085126
  0.183     -0.06755    0.26312    0.0087276  0.0066415  0.37033
  0.03496   -0.12627   -0.052626  -0.34897    0.14672    0.14799
 -0.21821   -0.042785   0.2661    -1.1105     0.31789    0.27278
  0.054468  -0.27458    0.42732   -0.44101   -0.19302   -0.32948
  0.61501   -0.22301   -0.36354   -0.34983   -0.16125   -0.17195
 -3.363      0.45146   -0.13753    0.31107    0.2061     0.33063
  0.45879    0.24256    0.042342   0.074837  -0.12869    0.12066
  0.42843   -0.4704    -0.18937    0.32685    0.26079    0.20518
 -0.18432   -0.47658    0.69193    0.18731   -0.12516    0.35447
 -0.1969    -0.58981   -0.88914    0.5176     0.13177   -0.078557
  0.032963  -0.19411    0.15109    0.10547   -0.1113    -0.61533
  0.0948    -0.3393    -0.20071   -0.30197    0.29531    0.28017
  0.16049    0.25294   -0.44266   -0.39412    0.13486    0.25178
 -0.044114   1.1519     0.32234   -0.34323   -0.10713   -0.15616
  0.031206   0.46636   -0.52761   -0.39296   -0.068424  -0.04072
  0.41508   -0.34564    0.71001   -0.364      0.2996     0.032281
  0.34035    0.23452    0.78342    0.48045   -0.1609     0.40102
 -0.071795  -0.16531    0.082153   0.52065    0.24194    0.17113
  0.33552   -0.15725   -0.38984    0.59337   -0.19388   -0.39864
 -0.47901    1.0835     0.24473    0.41309    0.64952    0.46846
  0.024386  -0.72087   -0.095061   0.10095   -0.025229   0.29435
 -0.57696    0.53166   -0.0058338 -0.3304     0.19661   -0.085206
  0.34225    0.56262    0.19924   -0.027111  -0.44567    0.17266
  0.20887   -0.40702    0.63954    0.50708   -0.31862   -0.39602
 -0.1714    -0.040006  -0.45077   -0.32482   -0.0316     0.54908
 -0.1121     0.12951   -0.33577   -0.52768   -0.44592   -0.45388
  0.66145    0.33023   -1.9089     0.5318     0.21626   -0.13152
  0.48258    0.68028   -0.84115   -0.51165    0.40017    0.17233
 -0.033749   0.045275   0.37398   -0.18252    0.19877    0.1511
  0.029803   0.16657   -0.12987   -0.50489    0.55311   -0.22504
  0.13085   -0.78459    0.36481   -0.27472    0.031805   0.53052
 -0.20078    0.46392   -0.63554    0.040289  -0.19142   -0.0097011
  0.068084  -0.10602    0.25567    0.096125  -0.10046    0.15016
 -0.26733   -0.26494    0.057888   0.062678  -0.11596    0.28115
  0.25375   -0.17954    0.20615    0.24189    0.062696   0.27719
 -0.42601   -0.28619   -0.44697   -0.082253  -0.73415   -0.20675
 -0.60289   -0.06728    0.15666   -0.042614   0.41368   -0.17367
 -0.54012    0.23883    0.23075    0.13608   -0.058634  -0.089705
  0.18469    0.023634   0.16178    0.23384    0.24267    0.091846 ]
======================================================================
Shape of glove vector:  (300,)
In [0]:
def getWord2VecVectors(texts):
    word2VecTextsVectors = [];
    for preProcessedText in tqdm(texts):
        word2VecTextVector = np.zeros(300);
        numberOfWordsInText = 0;
        for word in preProcessedText.split():
            if word in gloveWords:
                word2VecTextVector += gloveModel[word];
                numberOfWordsInText += 1;
        if numberOfWordsInText != 0:
            word2VecTextVector = word2VecTextVector / numberOfWordsInText;
        word2VecTextsVectors.append(word2VecTextVector);
    return word2VecTextsVectors;

1. Vectorizing project_essay

In [0]:
word2VecEssaysVectors = getWord2VecVectors(preProcessedEssaysWithoutStopWords);
In [0]:
print("Shape of Word2Vec vectorization matrix of essays: {},{}".format(len(word2VecEssaysVectors), len(word2VecEssaysVectors[0])));
equalsBorder(70);
print("Sample essay: ");
equalsBorder(70);
print(preProcessedEssaysWithoutStopWords[0]);
equalsBorder(70);
print("Word2Vec vector of sample essay: ");
equalsBorder(70);
print(word2VecEssaysVectors[0]);
Shape of Word2Vec vectorization matrix of essays: 109248,300
======================================================================
Sample essay: 
======================================================================
students english learners working english second third languages melting pot refugees immigrants native born americans bringing gift language school 24 languages represented english learner program students every level mastery also 40 countries represented families within school student brings wealth knowledge experiences us open eyes new cultures beliefs respect limits language limits world ludwig wittgenstein english learner strong support system home begs resources many times parents learning read speak english along side children sometimes creates barriers parents able help child learn phonetics letter recognition reading skills providing dvd players students able continue mastery english language even no one home able assist families students within level 1 proficiency status offered part program educational videos specially chosen english learner teacher sent home regularly watch videos help child develop early reading skills parents not access dvd player opportunity check dvd player use year plan use videos educational dvd years come el students nannan
======================================================================
Word2Vec vector of sample essay: 
======================================================================
[-1.40030644e-02  8.78995685e-02  3.50108161e-02 -5.90358980e-03
 -5.93166809e-02 -6.21039893e-02 -2.96711248e+00  9.45840302e-02
 -8.18737785e-03  4.46964161e-02 -7.64722101e-02  6.97099444e-02
  8.44441262e-02 -1.22974138e-01 -3.55310208e-02 -8.90947154e-02
  1.20959579e-01 -1.21977699e-01  4.61334597e-02 -3.33640832e-02
  1.24900557e-01  7.18837631e-02 -6.14885114e-02 -2.67269047e-02
  6.82086621e-02 -3.60263034e-02  1.17172255e-01 -1.17868631e-01
 -1.13467710e-01 -9.25920168e-02 -2.42461725e-01 -7.92963658e-02
  3.52513154e-03  1.79752468e-01 -4.69217812e-02 -3.56593007e-02
 -7.95331477e-03 -6.71107383e-04 -1.80828067e-02 -1.16224805e-02
 -3.69645852e-02  1.61287176e-01 -1.75201329e-01 -6.02256376e-02
  1.48811886e-02 -9.00106181e-02  7.72160490e-02  7.42989819e-02
 -1.02682389e-02 -1.33311658e-01 -2.82030537e-02 -7.71051879e-03
  7.33988450e-02  3.54095087e-02 -5.80719597e-03 -8.70242758e-02
 -3.57117638e-02  2.78475651e-02 -1.54957291e-01 -3.24157495e-02
 -5.93266570e-02 -8.80254174e-02  2.18914318e-01 -1.22730395e-02
 -1.05831485e-01  1.53985730e-01  7.15618933e-02 -3.97147470e-02
  1.47169116e-01 -4.50476644e-03 -1.49678829e-01  5.52201396e-02
  3.04915879e-02 -6.24086617e-02 -7.68483134e-02 -7.50149195e-02
 -1.07105068e-01 -2.69954530e-02  1.28067340e-01 -3.42946330e-02
  4.24139667e-02 -4.49685043e-01  1.52793905e-01 -9.06178181e-02
 -6.67951510e-02 -2.72063766e-02  7.37261792e-02 -8.64977130e-02
  1.64616877e-01  4.86745523e-02 -4.44542828e-02 -3.04823530e-02
  2.63897436e-02 -6.59345034e-02 -5.21813664e-02 -7.45015886e-02
 -2.21975948e+00  8.57858456e-02  7.73778584e-02  1.14644799e-01
 -1.50536483e-01 -5.17326940e-02  3.23826117e-02 -1.15700542e-01
  7.15651973e-02  9.15412617e-02  5.41334631e-02 -1.25451318e-01
  2.80941483e-02 -3.95890262e-02 -1.67010497e-02  1.74708879e-02
  4.58374505e-02  2.56664910e-01  3.74891134e-02  3.00990497e-02
 -2.18904765e-01  9.37672966e-02  9.99403436e-02  5.26255996e-02
 -6.67958718e-02  5.97650946e-02  4.14311192e-02 -6.85917603e-02
  1.72453235e-02  1.02485026e-01  3.02940430e-02  9.59998859e-03
  1.96364913e-02  1.22438477e-01  7.98410557e-02  1.92611322e-02
  6.44085906e-03  4.94252148e-03 -5.36137718e-03 -1.17976934e-01
  1.77991634e-01 -2.51954819e-02  8.02478188e-02  2.29125079e-01
  3.79080403e-02  1.22892819e-02  7.19621470e-02 -9.25031570e-02
 -8.86571674e-02 -4.74898563e-02  1.68688409e-02 -1.15134901e-01
  1.76528904e-01 -6.30485141e-02 -4.99678329e-02 -1.00350507e-01
  1.25089302e-02 -4.08706114e-02  4.50565289e-02  2.49286074e-02
 -1.29713758e-03 -3.21404376e-02 -2.52972249e-02 -9.63531510e-02
  8.42448993e-04 -7.29482953e-03 -3.77497893e-02 -9.35034987e-02
 -3.45719793e-02  7.15921796e-02 -1.29330935e-01  1.28508101e-02
  4.24846988e-02 -8.43078228e-02  4.79772134e-02 -3.05753799e-02
 -3.03772013e-02 -2.10572558e-01 -1.05464289e-03  5.18230436e-02
 -4.39921874e-02  5.29591584e-02 -1.08551689e-01  2.88053128e-02
 -4.88957058e-02  2.31962381e-01 -2.90986193e-02 -2.83725755e-02
 -6.80350899e-02 -6.99966387e-02 -6.80414679e-02 -7.63552362e-02
 -1.59287859e-02 -2.59947651e-03 -7.81848121e-03 -1.14299579e-01
 -2.02054698e-02  1.21184430e-03  2.59984919e-02 -7.64172013e-02
  9.47882617e-03 -5.71751181e-02  1.25667972e-01 -4.60388139e-02
  5.51296403e-02 -6.73280980e-02 -2.06862389e-02  1.12049165e-01
 -7.63451436e-02  4.71124027e-02  6.32404235e-02 -2.13828034e-02
  1.24239236e-01  5.08985235e-02  2.05136711e-03  1.45916498e-02
  4.25123886e-02 -9.41766832e-02 -3.08569389e-02 -2.57995470e-02
 -3.53808765e-02 -7.16000389e-02  1.35426121e-02  4.57596799e-02
 -1.85721693e-01 -6.62042523e-02 -1.45448285e-01  5.50366758e-02
 -2.09367026e+00  1.23479489e-01 -1.46630889e-01 -8.86940765e-02
 -7.32806463e-02 -1.48629733e-01  3.23867248e-03  7.08553181e-02
  1.10315906e-02 -2.35431879e-02 -7.69633283e-02 -1.13640894e-01
  9.96301846e-02 -5.70585054e-02 -5.45997987e-04  9.42995174e-02
 -1.40422433e-01 -5.03571812e-04 -2.50305216e-01  3.79384141e-02
 -6.44086637e-02 -1.53146188e-02 -2.55858274e-02 -1.10195376e-01
  1.62183899e-02 -1.61929591e-02  2.03421993e-02  1.21424534e-01
  5.02740463e-02  2.37900799e-02  9.07398322e-02  1.57962685e-02
  3.73036075e-02 -8.14876248e-02  1.37349395e-01 -8.17880913e-02
  9.27907812e-02  6.76093826e-03 -5.22928389e-02  6.02994188e-02
  8.28096711e-03 -1.05344042e-01 -1.02705751e-01  2.45275938e-02
 -1.18970611e-02  9.86759282e-02 -1.92870134e-02  9.71936577e-03
 -1.40249490e-01  1.61314103e-01 -4.55344879e-02  2.21929812e-02
  9.54108215e-02 -1.25028370e-02  2.89625007e-02  1.65818081e-02
 -2.34467852e-02 -7.88610081e-02  3.34242148e-03  4.43269879e-02
 -4.08419376e-02  6.06990416e-02  2.33916564e-02 -1.02773899e-02
  9.21596550e-02  9.90483805e-02  7.50525638e-03 -4.07725570e-03
 -6.93980047e-02 -3.50341946e-02 -8.79849597e-02 -4.10474223e-02
  4.55004698e-03  2.27073689e-01  1.37340472e-01  4.43856114e-02]

2. Vectorizing project_title

In [0]:
word2VecTitlesVectors = getWord2VecVectors(preProcessedProjectTitlesWithoutStopWords);

In [0]:
print("Shape of Word2Vec vectorization matrix of project titles: {}, {}".format(len(word2VecTitlesVectors), len(word2VecTitlesVectors[0])));
equalsBorder(70);
print("Sample title: ");
equalsBorder(70);
print(preProcessedProjectTitlesWithoutStopWords[0]);
equalsBorder(70);
print("Word2Vec vector of sample title: ");
equalsBorder(70);
print(word2VecTitlesVectors[0]);
Shape of Word2Vec vectorization matrix of project titles: 109248, 300
======================================================================
Sample title: 
======================================================================
educational support english learners home
======================================================================
Word2Vec vector of sample title: 
======================================================================
[-4.1285000e-02  4.4970000e-02  1.4283080e-01  1.9901860e-02
 -8.4519200e-02 -4.3207400e-01 -2.8496800e+00 -2.2953320e-01
  2.1736960e-01  3.4239600e-01 -7.5568200e-02  1.8077600e-01
  1.3998316e-01 -1.6401800e-01 -2.9812820e-01 -2.5030200e-01
  2.0420960e-01 -1.6882720e-01  6.5439800e-02 -1.6061000e-01
  2.2179020e-01  2.9944900e-01  2.7358000e-02 -8.8528800e-02
  1.5856400e-01  6.2905000e-02  2.0427440e-01 -1.9312560e-01
 -9.2904600e-02 -2.2050020e-01 -5.7761060e-01 -1.2101294e-01
  1.6846980e-01  2.8212460e-01 -1.8210120e-01  1.7754000e-02
  1.4805200e-01  4.1059000e-02  3.1145000e-02 -9.5658000e-02
 -9.6840000e-03  2.4896520e-01 -2.5047440e-01  7.7859000e-02
 -3.7512000e-03 -2.7071920e-01  2.5586200e-02  2.3205600e-01
  1.0154800e-01 -5.2259200e-01 -1.3211440e-01  1.1908300e-01
  2.7147196e-01  5.6135400e-02 -5.3140200e-02 -1.4937160e-01
 -1.0488160e-01  1.2059600e-01 -1.2639620e-01 -1.4316640e-01
 -2.2147600e-01 -1.9137800e-01  1.6595340e-01 -5.6078000e-02
  3.9884400e-02  1.0854760e-01  1.5552920e-01  7.8204600e-02
  9.5928000e-02 -6.2156000e-03 -1.1407312e-01  3.6862800e-02
 -8.7530020e-02 -4.7668000e-02 -2.3264200e-01 -6.1687200e-02
 -3.1690916e-01 -1.1851380e-01  1.4931240e-01 -7.7857200e-02
  1.8634840e-01 -4.6202100e-01  2.7096800e-01 -3.0512800e-02
 -2.1226400e-01 -1.5356200e-02  1.0844260e-01 -8.2669200e-02
  2.8918600e-01  1.3372960e-01 -8.3522800e-02  4.6474200e-02
  2.0703580e-01 -2.1937640e-01 -1.0252400e-01 -2.5177000e-01
 -2.8408000e+00  1.6622880e-01  1.1216234e-01  2.0837920e-01
 -1.5711600e-01 -1.9159400e-01 -1.4992160e-01 -2.7392820e-01
  3.4989140e-01  1.3991600e-01  1.6275200e-01  1.3887200e-01
  1.8212760e-01 -3.2218600e-02  4.3172000e-02  1.8323640e-01
  1.2295780e-01  4.4706600e-01  2.1688400e-02 -3.8988200e-02
 -3.2467400e-01  3.8389160e-01 -1.4416560e-01  1.1117380e-01
 -1.6218300e-01  1.3871928e-01  1.4305240e-01 -7.6173200e-02
  8.9476800e-02  2.6043820e-01  5.1114000e-02  1.0619800e-01
  1.5968840e-01  1.0530680e-01  8.6300000e-02  1.4667260e-01
  1.2320460e-02 -6.6124620e-02 -1.1017760e-01 -1.5091940e-01
  2.1297280e-01 -3.2808520e-01  1.4493194e-01  2.1848680e-01
 -4.1809800e-03  8.5340000e-02 -1.2410789e-01 -2.2308140e-01
  8.8026000e-02  1.9555000e-01 -3.7981400e-02 -1.7720080e-01
  3.4328600e-01 -3.7459600e-01 -1.7268200e-01 -2.1554400e-01
 -1.1533400e-01  9.9680000e-02 -1.9032980e-01  8.6249800e-02
  7.6682200e-02 -9.1090380e-02 -9.3714000e-02 -1.7333260e-01
  8.6429960e-02 -6.7933600e-02 -8.6470600e-02 -2.2431600e-01
 -2.8319800e-01  1.0138200e-01 -2.8114320e-01 -1.1168240e-01
  2.1770560e-02 -1.3971160e-01  2.1795080e-01 -1.1995600e-01
 -1.3166600e-02 -3.4848260e-01 -3.0102000e-02  2.3396200e-02
  2.8840000e-02  2.8763000e-01 -2.3679600e-02  1.1806440e-01
 -3.2261460e-01  2.2622920e-01  1.9506400e-02  1.4363200e-01
 -1.3668380e-01 -1.0521880e-01 -3.9385400e-03 -4.6388000e-02
 -7.7493780e-02 -2.4700800e-02 -5.2006200e-02 -2.6299360e-01
 -2.5607520e-01  2.1704520e-01  5.6336000e-02 -6.3474400e-02
 -1.0400400e-01 -1.7901000e-01  2.0326180e-01 -2.8708740e-01
  1.0132000e-01 -1.6278080e-01  1.2441440e-01  3.2699820e-01
 -4.8321600e-02 -3.6052800e-02  2.2539620e-01 -8.2764000e-03
  3.1087258e-01  2.4090500e-01 -9.9590000e-02  1.2362460e-01
  1.7440000e-03 -1.6117280e-01  7.4570000e-02  3.1281120e-02
 -1.1758000e-02 -1.8464800e-02 -2.0872020e-01 -3.9510000e-03
 -5.7714400e-01 -1.8090080e-01 -2.8288200e-01 -2.4662120e-01
 -1.8806540e+00  4.4765400e-01 -2.9412700e-01 -1.7280000e-02
 -3.1931600e-01 -1.9190500e-01 -1.1642000e-02  1.7475600e-01
  1.3068840e-01  1.1943000e-01 -1.7219524e-01  1.9224000e-02
  2.2620000e-01 -1.0821980e-01  1.3789060e-01  2.6989320e-01
 -2.4364960e-01 -1.3650800e-01 -3.0984180e-01 -3.9546200e-02
 -1.1410800e-01 -6.6744640e-02  1.6330620e-01 -4.0601000e-01
  9.3793000e-02 -8.3026800e-02  9.0567600e-02  3.1595600e-01
  1.6786620e-01  1.0099860e-01  3.5043600e-02  6.6221200e-02
 -3.5907800e-02 -2.4589760e-01  2.6006800e-01 -8.0637000e-02
  1.5359624e-01 -1.1078680e-01 -5.6956400e-02  2.2253080e-01
  3.5808000e-02 -1.8873860e-01 -2.5032660e-01  3.6167400e-02
 -2.2424700e-01  2.7863640e-01  2.2622600e-02  1.3753300e-01
 -2.3369620e-01  2.8058040e-01  5.0818000e-02 -3.4805800e-02
  1.7916600e-01 -7.5374000e-02  7.1228900e-02  1.7556000e-01
 -5.8004120e-01 -2.0522500e-01 -1.3367960e-01  1.3656000e-02
 -2.9052200e-02  1.3698600e-02  1.1746340e-01 -2.3288400e-02
  2.7706200e-01  1.6106000e-01 -2.0183340e-01  5.7781800e-02
 -2.0954400e-01 -1.4111260e-02 -3.1186860e-01 -2.9536360e-02
 -1.7226500e-01  3.5709400e-01  2.9448200e-01  8.5600000e-05]

Tf-Idf Weighted Word2Vec Vectorization

1. Vectorizing project_essay

In [0]:
# Initializing tfidf vectorizer
tfIdfEssayTempVectorizer = TfidfVectorizer();
# Vectorizing preprocessed essays using tfidf vectorizer initialized above 
tfIdfEssayTempVectorizer.fit(preProcessedEssaysWithoutStopWords);
# Saving dictionary in which each word is key and it's idf is value
tfIdfEssayDictionary = dict(zip(tfIdfEssayTempVectorizer.get_feature_names(), list(tfIdfEssayTempVectorizer.idf_)));
# Creating set of all unique words used by tfidf vectorizer
tfIdfEssayWords = set(tfIdfEssayTempVectorizer.get_feature_names());
In [0]:
# Creating list to save tf-idf weighted vectors of essays
tfIdfWeightedWord2VecEssaysVectors = [];
# Iterating over each essay
for essay in tqdm(preProcessedEssaysWithoutStopWords):
    # Sum of tf-idf values of all words in a particular essay
    cumulativeSumTfIdfWeightOfEssay = 0;
    # Tf-Idf weighted word2vec vector of a particular essay
    tfIdfWeightedWord2VecEssayVector = np.zeros(300);
    # Splitting essay into list of words
    splittedEssay = essay.split();
    # Iterating over each word
    for word in splittedEssay:
        # Checking if word is in glove words and set of words used by tfIdf essay vectorizer
        if (word in gloveWords) and (word in tfIdfEssayWords):
            # Tf-Idf value of particular word in essay
            tfIdfValueWord = tfIdfEssayDictionary[word] * (essay.count(word) / len(splittedEssay));
            # Making tf-idf weighted word2vec
            tfIdfWeightedWord2VecEssayVector += tfIdfValueWord * gloveModel[word];
            # Summing tf-idf weight of word to cumulative sum
            cumulativeSumTfIdfWeightOfEssay += tfIdfValueWord;
    if cumulativeSumTfIdfWeightOfEssay != 0:
        # Taking average of sum of vectors with tf-idf cumulative sum
        tfIdfWeightedWord2VecEssayVector = tfIdfWeightedWord2VecEssayVector / cumulativeSumTfIdfWeightOfEssay;
    # Appending the above calculated tf-idf weighted vector of particular essay to list of vectors of essays
    tfIdfWeightedWord2VecEssaysVectors.append(tfIdfWeightedWord2VecEssayVector);

In [0]:
print("Shape of Tf-Idf weighted Word2Vec vectorization matrix of project essays: {}, {}".format(len(tfIdfWeightedWord2VecEssaysVectors), len(tfIdfWeightedWord2VecEssaysVectors[0])));
equalsBorder(70);
print("Sample Essay: ");
equalsBorder(70);
print(preProcessedEssaysWithoutStopWords[0]);
equalsBorder(70);
print("Tf-Idf Weighted Word2Vec vector of sample essay: ");
equalsBorder(70);
print(tfIdfWeightedWord2VecEssaysVectors[0]);
Shape of Tf-Idf weighted Word2Vec vectorization matrix of project essays: 109248, 300
======================================================================
Sample Essay: 
======================================================================
students english learners working english second third languages melting pot refugees immigrants native born americans bringing gift language school 24 languages represented english learner program students every level mastery also 40 countries represented families within school student brings wealth knowledge experiences us open eyes new cultures beliefs respect limits language limits world ludwig wittgenstein english learner strong support system home begs resources many times parents learning read speak english along side children sometimes creates barriers parents able help child learn phonetics letter recognition reading skills providing dvd players students able continue mastery english language even no one home able assist families students within level 1 proficiency status offered part program educational videos specially chosen english learner teacher sent home regularly watch videos help child develop early reading skills parents not access dvd player opportunity check dvd player use year plan use videos educational dvd years come el students nannan
======================================================================
Tf-Idf Weighted Word2Vec vector of sample essay: 
======================================================================
[-5.37582850e-02  7.68689598e-02  7.85741822e-02  4.38958976e-02
 -8.56874440e-02 -1.20832331e-01 -2.68120986e+00  7.17018732e-02
  1.03799206e-04 -5.17255299e-03 -2.67529751e-02  7.40185988e-02
  1.36881934e-01 -8.62706493e-02 -6.35020145e-02 -8.44084597e-02
  1.27523921e-01 -1.77105602e-01  3.68451284e-02 -5.74471880e-02
  1.86477259e-01  9.28786009e-02 -9.73137896e-02 -1.15230456e-02
  4.41962185e-02 -9.32894883e-02  1.11912943e-01 -1.17540961e-01
 -1.22150893e-01 -9.14028838e-02 -1.73918944e-01 -4.54143189e-02
 -7.82036060e-02  3.05617633e-01 -8.71850266e-02  6.31466708e-03
  1.15683161e-01  1.71477594e-02 -5.52983597e-02  9.08989585e-02
 -3.89808292e-04  1.97696142e-01 -4.08078376e-01 -5.39990199e-02
 -1.20129600e-02 -1.12456389e-01  2.92046345e-02  1.37924729e-01
  2.83465620e-02 -2.26817169e-01 -2.29639267e-02  6.94257143e-03
  5.80535394e-02  2.86454339e-02 -7.51508216e-02 -6.21569354e-02
 -1.41805544e-01  2.78707358e-02 -1.63165999e-01 -1.29716251e-01
 -5.67625355e-02 -8.59507500e-02  3.54019902e-01 -4.96274469e-02
 -6.88414062e-02  1.58623510e-01  1.24798600e-01  4.29711440e-02
  7.82814323e-02 -1.73260116e-02 -1.23679491e-01  1.47617250e-01
  4.27083617e-02 -1.16531047e-01 -1.27122530e-01 -5.93638332e-03
 -1.99224414e-01 -8.66160391e-02  2.47701354e-01  1.61218205e-02
  3.56880345e-02 -3.71320273e-01  2.65501745e-01 -4.56454865e-02
 -7.85433814e-02 -5.99177835e-02  4.42212779e-02 -8.20739267e-02
  2.14031939e-01  2.42131497e-02 -1.34069697e-01  7.15871686e-03
  4.00667270e-02 -6.75881497e-02 -7.07967357e-02 -2.15984749e-02
 -2.09734597e+00  1.02300477e-01  6.61169899e-02  5.70146517e-02
 -1.91302495e-01 -1.38114014e-01 -1.10709961e-01 -1.66994098e-01
  9.17800823e-02  1.35327093e-01  2.20333244e-02 -3.83844831e-02
  2.57206511e-02 -5.54503565e-02 -3.41973653e-03  1.99777588e-02
  4.85050396e-02  2.13190534e-01  4.64281665e-02  6.51171751e-02
 -5.80015838e-02  1.19900386e-01  1.18803830e-01  7.05550873e-02
 -1.87330886e-01  1.41219129e-01  1.33569574e-01  1.00530000e-01
  4.14498415e-02  1.39860952e-01 -7.95709830e-02  9.70242332e-02
  1.07442882e-01  9.00794808e-02  7.47745032e-02  4.18772282e-02
 -7.10347826e-03 -7.62379756e-03 -7.31715828e-02 -1.16370646e-01
  2.82271708e-01 -5.30885621e-02  4.51472249e-02  2.61376253e-01
  1.29080066e-02  3.96843846e-02  1.04430681e-01 -1.30495811e-01
 -1.17999239e-01 -1.02810089e-01 -6.52713784e-02 -1.81350799e-01
  1.55415740e-01 -4.43517889e-02 -8.34350788e-02 -1.31445407e-01
 -8.87524029e-02 -1.15321245e-02  8.67587067e-03  3.55646447e-02
 -4.32365925e-02  2.44285859e-03  2.73165854e-02 -1.91651165e-01
  6.70942750e-03  1.45533103e-02 -5.95191056e-02 -9.78336553e-02
 -4.61200683e-02  1.04017495e-02 -1.68129330e-01 -5.53455289e-02
 -1.95353920e-02 -3.24088827e-03  9.94121739e-02 -2.20584067e-02
  1.36190091e-02 -3.13014669e-01  4.46748268e-02  6.11251996e-02
 -5.59088914e-02  8.07071841e-02 -7.80920682e-02  1.05535003e-02
 -8.49705076e-02  1.87800458e-01 -5.53305425e-02 -4.05296946e-02
 -1.68105655e-02 -9.64697267e-02 -1.00114054e-01 -1.25303984e-01
 -6.77861115e-02  1.38106300e-02  4.97948787e-02 -1.04414463e-01
  3.12147536e-03 -2.46650333e-02  1.56250756e-02 -3.41987984e-02
  2.90197738e-02 -1.30795750e-01  1.71425098e-01 -1.33199913e-01
 -4.35452619e-02 -1.52841321e-01  3.37717104e-02  2.11400042e-01
 -1.08493100e-01  6.64905827e-02  4.45687503e-02 -3.38898797e-03
  1.47302984e-01  3.10931848e-02  6.94873935e-03 -3.79090162e-02
  3.97055902e-02 -3.12563998e-02  2.99815273e-02 -9.30892230e-03
 -3.37192802e-02 -7.79667288e-02  4.20509297e-02  4.33535394e-02
 -2.38238094e-01 -4.11188300e-02 -1.93930088e-01  1.15012485e-01
 -2.14605373e+00  1.36975648e-01 -1.79026305e-01 -1.42630498e-01
 -1.37558424e-01 -1.55433436e-01 -6.96701214e-02  1.05328488e-01
  3.43486342e-02 -2.37676310e-03 -6.80980842e-02 -1.92470331e-01
  1.54727348e-01 -7.47455695e-02 -1.58054203e-02  3.33369549e-02
 -1.70510752e-01 -5.74331307e-02 -2.38994456e-01  5.64188931e-02
 -8.55051184e-02 -5.52984572e-02 -5.00408589e-02 -6.81572658e-02
  5.15848477e-03 -3.58487773e-02  7.00056842e-02  1.33127170e-01
  5.57938159e-02  1.03106840e-01  4.18598320e-02 -2.78162076e-03
  8.83131944e-02 -1.31482831e-01  1.34875022e-01 -8.31772344e-02
  1.62319378e-01  9.25839856e-02 -7.07548194e-02  1.74355644e-01
  1.53106818e-02 -1.74504449e-01 -5.39158255e-02 -1.16968555e-02
 -1.37824311e-01  1.07713713e-01  4.48548015e-02  1.07272158e-01
 -1.59084558e-01  1.94342786e-01 -4.73514319e-02 -4.87250503e-02
  2.82023483e-02 -4.18474756e-02  8.04397595e-02 -3.34005484e-02
 -1.00808502e-01 -1.15380334e-01  7.05894205e-02  2.92052920e-02
 -5.72604859e-02 -7.39274088e-03  1.44106517e-02 -2.64282237e-02
  2.31512689e-01  1.50161666e-01 -5.21462274e-02 -1.00796916e-02
 -4.47392305e-02  4.83958092e-02 -2.21927272e-01 -9.69846899e-02
 -5.91211767e-03  2.52508756e-01  1.08677704e-01  5.05047869e-02]

2. Vectorizing project_title

In [0]:
# Initializing tfidf vectorizer
tfIdfTitleTempVectorizer = TfidfVectorizer();
# Vectorizing preprocessed titles using tfidf vectorizer initialized above 
tfIdfTitleTempVectorizer.fit(preProcessedProjectTitlesWithoutStopWords);
# Saving dictionary in which each word is key and it's idf is value
tfIdfTitleDictionary = dict(zip(tfIdfTitleTempVectorizer.get_feature_names(), list(tfIdfTitleTempVectorizer.idf_)));
# Creating set of all unique words used by tfidf vectorizer
tfIdfTitleWords = set(tfIdfTitleTempVectorizer.get_feature_names());
In [0]:
# Creating list to save tf-idf weighted vectors of project titles
tfIdfWeightedWord2VecTitlesVectors = [];
# Iterating over each title
for title in tqdm(preProcessedProjectTitlesWithoutStopWords):
    # Sum of tf-idf values of all words in a particular project title
    cumulativeSumTfIdfWeightOfTitle = 0;
    # Tf-Idf weighted word2vec vector of a particular project title
    tfIdfWeightedWord2VecTitleVector = np.zeros(300);
    # Splitting title into list of words
    splittedTitle = title.split();
    # Iterating over each word
    for word in splittedTitle:
        # Checking if word is in glove words and set of words used by tfIdf title vectorizer
        if (word in gloveWords) and (word in tfIdfTitleWords):
            # Tf-Idf value of particular word in title
            tfIdfValueWord = tfIdfTitleDictionary[word] * (title.count(word) / len(splittedTitle));
            # Making tf-idf weighted word2vec
            tfIdfWeightedWord2VecTitleVector += tfIdfValueWord * gloveModel[word];
            # Summing tf-idf weight of word to cumulative sum
            cumulativeSumTfIdfWeightOfTitle += tfIdfValueWord;
    if cumulativeSumTfIdfWeightOfTitle != 0:
        # Taking average of sum of vectors with tf-idf cumulative sum
        tfIdfWeightedWord2VecTitleVector = tfIdfWeightedWord2VecTitleVector / cumulativeSumTfIdfWeightOfTitle;
    # Appending the above calculated tf-idf weighted vector of particular title to list of vectors of project titles
    tfIdfWeightedWord2VecTitlesVectors.append(tfIdfWeightedWord2VecTitleVector);
In [0]:
print("Shape of Tf-Idf weighted Word2Vec vectorization matrix of project titles: {}, {}".format(len(tfIdfWeightedWord2VecTitlesVectors), len(tfIdfWeightedWord2VecTitlesVectors[0])));
equalsBorder(70);
print("Sample Title: ");
equalsBorder(70);
print(preProcessedProjectTitlesWithoutStopWords[0]);
equalsBorder(70);
print("Tf-Idf Weighted Word2Vec vector of sample title: ");
equalsBorder(70);
print(tfIdfWeightedWord2VecTitlesVectors[0]);
Shape of Tf-Idf weighted Word2Vec vectorization matrix of project titles: 109248, 300
======================================================================
Sample Title: 
======================================================================
educational support english learners home
======================================================================
Tf-Idf Weighted Word2Vec vector of sample title: 
======================================================================
[-3.23904891e-02  5.58064810e-02  1.32666911e-01  3.84227573e-02
 -6.71984492e-02 -4.30940397e-01 -2.84607947e+00 -2.45905055e-01
  1.96794858e-01  3.19604663e-01 -6.12568872e-02  1.59218099e-01
  1.25129027e-01 -1.67580327e-01 -2.82644062e-01 -2.47555536e-01
  2.18304104e-01 -1.57431101e-01  7.66481545e-02 -1.61436633e-01
  2.38451267e-01  2.86712258e-01  2.70730890e-02 -9.74962294e-02
  1.67511144e-01  7.18131102e-02  1.82846112e-01 -1.96778087e-01
 -8.19948978e-02 -2.25877630e-01 -5.54573752e-01 -1.28462870e-01
  1.61012606e-01  2.94412658e-01 -1.63196910e-01 -1.23217523e-02
  1.37466355e-01  4.45437696e-02  4.65691769e-02 -1.17867965e-01
 -2.41502151e-03  2.24350668e-01 -2.51274676e-01  8.29431360e-02
 -1.65996673e-02 -2.47747576e-01  1.45110611e-03  2.37117949e-01
  9.71345150e-02 -5.13516477e-01 -1.40296688e-01  1.42775548e-01
  2.89949805e-01  6.49771690e-02 -3.41581088e-02 -1.58076306e-01
 -1.07731741e-01  7.59015357e-02 -1.21511682e-01 -1.16519972e-01
 -2.27321940e-01 -1.63525257e-01  1.80860125e-01 -4.17314689e-02
  4.60171896e-02  1.00024674e-01  1.54588362e-01  8.25394911e-02
  7.45768118e-02 -1.80240543e-02 -1.22956246e-01 -4.97450371e-03
 -8.06577406e-02 -5.00614538e-02 -2.15836210e-01 -5.89271531e-02
 -3.26363335e-01 -1.32706775e-01  1.61236199e-01 -1.25038790e-01
  1.96493846e-01 -4.95095193e-01  2.34765396e-01 -4.44646606e-02
 -2.04266125e-01 -3.21415735e-02  8.48111983e-02 -7.27603472e-02
  2.79183660e-01  1.18968262e-01 -7.43300594e-02  6.34587771e-02
  1.99863053e-01 -2.13382053e-01 -1.01221319e-01 -2.49884070e-01
 -2.92249478e+00  1.60273141e-01  7.74579728e-02  1.85323805e-01
 -1.33255909e-01 -2.00013519e-01 -1.31974722e-01 -2.62288530e-01
  3.54852941e-01  1.18537924e-01  1.62207829e-01  1.24436802e-01
  1.98867481e-01 -4.87526944e-03  3.00886908e-02  2.09330567e-01
  1.17189984e-01  3.94887340e-01  2.52941492e-02 -5.13348554e-02
 -2.91140828e-01  4.06939567e-01 -1.70319175e-01  1.17651155e-01
 -1.66813086e-01  1.53049826e-01  1.41255472e-01 -8.10785736e-02
  9.57549943e-02  2.73610111e-01  5.85622995e-02  7.91410001e-02
  1.47619459e-01  9.75521835e-02  6.74487028e-02  1.53125504e-01
  2.02791106e-02 -5.59403852e-02 -1.02109913e-01 -1.22913427e-01
  1.99873969e-01 -3.21872719e-01  1.38343165e-01  2.17196179e-01
  4.95201760e-03  8.52128333e-02 -1.45880901e-01 -2.10862397e-01
  1.20343357e-01  2.15598061e-01 -1.14038072e-02 -1.72172799e-01
  3.24157324e-01 -3.82818101e-01 -1.87580283e-01 -2.00827204e-01
 -1.41863370e-01  9.63016678e-02 -2.01659119e-01  6.74342164e-02
  7.12185747e-02 -1.04314039e-01 -9.08169483e-02 -1.63495605e-01
  9.68230169e-02 -5.01176209e-02 -8.34015616e-02 -1.88998660e-01
 -2.84065057e-01  1.16975197e-01 -2.80836800e-01 -9.33191327e-02
  3.79583269e-02 -1.22755412e-01  2.30408258e-01 -1.31968890e-01
  9.72824714e-03 -3.44272546e-01 -2.09522211e-03  2.45944018e-02
  2.94077607e-02  2.67568157e-01 -2.69460269e-02  1.25412311e-01
 -3.47031083e-01  2.09328241e-01  1.25385338e-02  1.55654760e-01
 -1.41368915e-01 -1.01749781e-01 -4.77312036e-04 -4.82325465e-02
 -7.15727478e-02 -3.63658602e-02 -4.33504397e-02 -2.71410315e-01
 -2.40079853e-01  2.01171435e-01  6.39005674e-02 -4.86787485e-02
 -1.48623863e-01 -1.72130906e-01  1.97761227e-01 -3.13043504e-01
  1.07772898e-01 -1.54518908e-01  1.31855435e-01  3.39703669e-01
 -4.51652340e-02 -4.05998340e-02  2.03610454e-01  8.84982054e-03
  3.05974297e-01  2.54736700e-01 -1.06925907e-01  1.27066655e-01
 -1.88835779e-02 -1.56632041e-01  8.45142200e-02  5.70681135e-02
  1.01119358e-02 -6.62387316e-03 -2.18552410e-01  1.20985419e-02
 -5.54006219e-01 -1.72367117e-01 -2.90325016e-01 -2.34816399e-01
 -1.94243114e+00  4.36715446e-01 -2.80713863e-01 -6.33991309e-03
 -2.90035778e-01 -1.98732349e-01  2.96737137e-02  1.50873684e-01
  1.16943997e-01  1.39741722e-01 -1.82238609e-01  4.09714520e-02
  2.37176600e-01 -1.24515116e-01  1.41648743e-01  2.64206287e-01
 -2.40551078e-01 -1.40415333e-01 -2.92432371e-01 -3.03761027e-02
 -9.90320454e-02 -8.43648662e-02  1.81116706e-01 -4.05719699e-01
  1.22898740e-01 -8.80109292e-02  1.09543672e-01  2.96110858e-01
  1.85027885e-01  9.14976115e-02  9.63416424e-03  5.50340717e-02
 -2.59328007e-02 -2.43942768e-01  2.54260096e-01 -1.03280950e-01
  1.56799018e-01 -9.58635926e-02 -4.31948365e-02  2.01228907e-01
  5.20033765e-02 -2.08030399e-01 -2.49149283e-01  3.11752465e-02
 -2.39410711e-01  2.54421815e-01  3.50420005e-02  1.31625993e-01
 -2.19027956e-01  2.75093693e-01  4.31276229e-02 -6.89266192e-02
  1.80694153e-01 -9.77254221e-02  6.52789959e-02  1.81468103e-01
 -5.79288980e-01 -1.91501478e-01 -1.43298895e-01  1.56769073e-02
 -2.28584041e-02 -7.96762354e-03  1.38764109e-01 -2.67804890e-02
  3.02808634e-01  1.63688874e-01 -1.98263925e-01  8.94007093e-02
 -2.01132765e-01  8.29230669e-03 -3.17426319e-01 -4.07929287e-02
 -1.63872993e-01  3.69860278e-01  2.90009047e-01  4.56005599e-02]

Vectorizing numerical features

1. Vectorizing price

In [0]:
# Standardizing the price data using StandardScaler(Uses mean and std for standardization)
priceScaler = StandardScaler();
priceScaler.fit(projectsData['price'].values.reshape(-1, 1));
priceStandardized = priceScaler.transform(projectsData['price'].values.reshape(-1, 1));
In [0]:
print("Shape of standardized matrix of prices: ", priceStandardized.shape);
equalsBorder(70);
print("Sample original prices: ");
equalsBorder(70);
print(projectsData['price'].values[0:5]);
print("Sample standardized prices: ");
equalsBorder(70);
print(priceStandardized[0:5]);
Shape of standardized matrix of prices:  (109245, 1)
======================================================================
Sample original prices: 
======================================================================
[154.6  299.   516.85 232.9   67.98]
Sample standardized prices: 
======================================================================
[[-0.39052147]
 [ 0.00240752]
 [ 0.5952024 ]
 [-0.17745817]
 [-0.62622444]]

2. Vectorizing quantity

In [0]:
# Standardizing the quantity data using StandardScaler(Uses mean and std for standardization)
quantityScaler = StandardScaler();
quantityScaler.fit(projectsData['quantity'].values.reshape(-1, 1));
quantityStandardized = quantityScaler.transform(projectsData['quantity'].values.reshape(-1, 1));
In [0]:
print("Shape of standardized matrix of quantities: ", quantityStandardized.shape);
equalsBorder(70);
print("Sample original quantities: ");
equalsBorder(70);
print(projectsData['quantity'].values[0:5]);
print("Sample standardized quantities: ");
equalsBorder(70);
print(quantityStandardized[0:5]);
Shape of standardized matrix of quantities:  (109245, 1)
======================================================================
Sample original quantities: 
======================================================================
[23  1 22  4  4]
Sample standardized quantities: 
======================================================================
[[ 0.23045805]
 [-0.6097785 ]
 [ 0.19226548]
 [-0.49520079]
 [-0.49520079]]

3. Vectorizing teacher_number_of_previously_posted_projects

In [0]:
# Standardizing the teacher_number_of_previously_posted_projects data using StandardScaler(Uses mean and std for standardization)
previouslyPostedScaler = StandardScaler();
previouslyPostedScaler.fit(projectsData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
previouslyPostedStandardized = previouslyPostedScaler.transform(projectsData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
In [0]:
print("Shape of standardized matrix of teacher_number_of_previously_posted_projects: ", previouslyPostedStandardized.shape);
equalsBorder(70);
print("Sample original quantities: ");
equalsBorder(70);
print(projectsData['teacher_number_of_previously_posted_projects'].values[0:5]);
print("Sample standardized teacher_number_of_previously_posted_projects: ");
equalsBorder(70);
print(previouslyPostedStandardized[0:5]);
Shape of standardized matrix of teacher_number_of_previously_posted_projects:  (109245, 1)
======================================================================
Sample original quantities: 
======================================================================
[0 7 1 4 1]
Sample standardized teacher_number_of_previously_posted_projects: 
======================================================================
[[-0.40153083]
 [-0.14952695]
 [-0.36553028]
 [-0.25752861]
 [-0.36553028]]

Taking 6k points(to avoid memory errors)

In [0]:
numberOfPoints = 6000;
# Categorical data
categoriesVectorsSub = categoriesVectors[0:numberOfPoints];
subCategoriesVectorsSub = subCategoriesVectors[0:numberOfPoints];
teacherPrefixVectorsSub = teacherPrefixVectors[0:numberOfPoints];
schoolStateVectorsSub = schoolStateVectors[0:numberOfPoints];
projectGradeVectorsSub = projectGradeVectors[0:numberOfPoints];

# Text data
bowEssayModelSub = bowEssayModel[0:numberOfPoints];
bowTitleModelSub = bowTitleModel[0:numberOfPoints];
tfIdfEssayModelSub = tfIdfEssayModel[0:numberOfPoints];
tfIdfTitleModelSub = tfIdfTitleModel[0:numberOfPoints];
word2VecEssaysVectorsSub = word2VecEssaysVectors[0:numberOfPoints];
word2VecTitlesVectorsSub = word2VecTitlesVectors[0:numberOfPoints];
tfIdfWeightedWord2VecEssaysVectorsSub = tfIdfWeightedWord2VecEssaysVectors[0:numberOfPoints];
tfIdfWeightedWord2VecTitlesVectorsSub = tfIdfWeightedWord2VecTitlesVectors[0:numberOfPoints];

# Numerical data
priceStandardizedSub = priceStandardized[0:numberOfPoints];
quantityStandardizedSub = quantityStandardized[0:numberOfPoints];
previouslyPostedStandardizedSub = previouslyPostedStandardized[0:numberOfPoints];
In [0]:
classesDataSub = projectsData['project_is_approved'][0:numberOfPoints].values
In [0]:
classesDataSub.shape
Out[0]:
(6000,)

Data Visualization using T-SNE

Classification using data merged with bag of words vectorized title and all considered categorical, numerical features

In [0]:
bowTitleAndOthers = hstack((bowTitleModelSub, categoriesVectorsSub, subCategoriesVectorsSub, teacherPrefixVectorsSub, schoolStateVectorsSub, projectGradeVectorsSub, priceStandardizedSub, previouslyPostedStandardizedSub));
bowTitleAndOthers.shape
Out[0]:
(6000, 1875)
In [0]:
perplexityValues = [5, 10, 30, 50, 80, 100]
for perplexityValue in perplexityValues:
    tsne = TSNE(n_components = 2, perplexity = perplexityValue, learning_rate = 200);
    bowTitleAndOthersEmbedded = tsne.fit_transform(bowTitleAndOthers.toarray());
    bowTitleAndOthersTsneData = np.hstack((bowTitleAndOthersEmbedded, classesDataSub.reshape(-1, 1)));
    bowTitleAndOthersTsneDataFrame = pd.DataFrame(bowTitleAndOthersTsneData, columns = ['Dimension1', 'Dimension2', 'Class']);
    colors = {0.0:'red', 1.0:'green'}
    plt.title("TSNE plot for merged data of BoW Title and Categorical, Numerical features - Perplexity({})".format(perplexityValue));
    plt.scatter(bowTitleAndOthersTsneDataFrame['Dimension1'], bowTitleAndOthersTsneDataFrame['Dimension2'], c = bowTitleAndOthersTsneDataFrame['Class'].apply(lambda x: colors[x]));
    plt.show();

Classification using data merged with Tf-Idf vectorized title and all considered categorical, numerical features

In [0]:
tfIdfTitleAndOthers = hstack((tfIdfTitleModelSub, categoriesVectorsSub, subCategoriesVectorsSub, teacherPrefixVectorsSub, schoolStateVectorsSub, projectGradeVectorsSub, priceStandardizedSub, previouslyPostedStandardizedSub));
tfIdfTitleAndOthers.shape
Out[0]:
(6000, 1875)
In [0]:
perplexityValues = [5, 10, 30, 50, 80, 100]
for perplexityValue in perplexityValues:
    tsne = TSNE(n_components = 2, perplexity = perplexityValue, learning_rate = 200);
    tfIdfTitleAndOthersEmbedded = tsne.fit_transform(tfIdfTitleAndOthers.toarray());
    tfIdfTitleAndOthersTsneData = np.hstack((tfIdfTitleAndOthersEmbedded, classesDataSub.reshape(-1, 1)));
    tfIdfTitleAndOthersTsneDataFrame = pd.DataFrame(tfIdfTitleAndOthersTsneData, columns = ['Dimension1', 'Dimension2', 'Class']);
    colors = {0.0:'red', 1.0:'green'}
    plt.title("TSNE plot for merged data of Tf-Idf Title and Categorical, Numerical features - Perplexity({})".format(perplexityValue));
    plt.scatter(tfIdfTitleAndOthersTsneDataFrame['Dimension1'], tfIdfTitleAndOthersTsneDataFrame['Dimension2'], c = tfIdfTitleAndOthersTsneDataFrame['Class'].apply(lambda x: colors[x]));
    plt.show();

Classification using data merged with Average Word2Vec vectorized title and all considered categorical, numerical features

In [0]:
word2VecTitleAndOthers = hstack((word2VecTitlesVectorsSub, categoriesVectorsSub, subCategoriesVectorsSub, teacherPrefixVectorsSub, schoolStateVectorsSub, projectGradeVectorsSub, priceStandardizedSub, previouslyPostedStandardizedSub));
word2VecTitleAndOthers.shape
Out[0]:
(6000, 401)
In [0]:
perplexityValues = [5, 10, 30, 50, 80, 100]
for perplexityValue in perplexityValues:
    tsne = TSNE(n_components = 2, perplexity = perplexityValue, learning_rate = 200);
    word2VecTitleAndOthersEmbedded = tsne.fit_transform(word2VecTitleAndOthers.toarray());
    word2VecTitleAndOthersTsneData = np.hstack((word2VecTitleAndOthersEmbedded, classesDataSub.reshape(-1, 1)));
    word2VecTitleAndOthersTsneDataFrame = pd.DataFrame(word2VecTitleAndOthersTsneData, columns = ['Dimension1', 'Dimension2', 'Class']);
    colors = {0.0:'red', 1.0:'green'}
    plt.title("TSNE plot for merged data of Average Word2Vec Title and Categorical, Numerical features - Perplexity({})".format(perplexityValue));
    plt.scatter(word2VecTitleAndOthersTsneDataFrame['Dimension1'], word2VecTitleAndOthersTsneDataFrame['Dimension2'], c = word2VecTitleAndOthersTsneDataFrame['Class'].apply(lambda x: colors[x]));
    plt.show();

Classification using data merged with Tf-idf Weighted Word2Vec vectorized title and all considered categorical, numerical features

In [0]:
tfIdfWeightedWord2VecTitleAndOthers = hstack((tfIdfWeightedWord2VecTitlesVectorsSub, categoriesVectorsSub, subCategoriesVectorsSub, teacherPrefixVectorsSub, schoolStateVectorsSub, projectGradeVectorsSub, priceStandardizedSub, previouslyPostedStandardizedSub));
tfIdfWeightedWord2VecTitleAndOthers.shape
Out[0]:
(6000, 401)
In [0]:
perplexityValues = [5, 10, 30, 50, 80, 100]
for perplexityValue in perplexityValues:
    tsne = TSNE(n_components = 2, perplexity = perplexityValue, learning_rate = 200);
    tfIdfWeightedWord2VecTitleAndOthersEmbedded = tsne.fit_transform(tfIdfWeightedWord2VecTitleAndOthers.toarray());
    tfIdfWeightedWord2VecTitleAndOthersTsneData = np.hstack((tfIdfWeightedWord2VecTitleAndOthersEmbedded, classesDataSub.reshape(-1, 1)));
    tfIdfWeightedWord2VecTitleAndOthersTsneDataFrame = pd.DataFrame(tfIdfWeightedWord2VecTitleAndOthersTsneData, columns = ['Dimension1', 'Dimension2', 'Class']);
    colors = {0.0:'red', 1.0:'green'}
    plt.title("TSNE plot for merged data of Tf-Idf Weighted Word2Vec Title and Categorical, Numerical features - Perplexity({})".format(perplexityValue));
    plt.scatter(tfIdfWeightedWord2VecTitleAndOthersTsneDataFrame['Dimension1'], tfIdfWeightedWord2VecTitleAndOthersTsneDataFrame['Dimension2'], c = tfIdfWeightedWord2VecTitleAndOthersTsneDataFrame['Class'].apply(lambda x: colors[x]));
    plt.show();

Classification using data merged with all vectorizations of project_title and with all considered categorical, numerical features

In [0]:
allFeatures = hstack((bowTitleModelSub, tfIdfTitleModelSub, word2VecTitlesVectorsSub, tfIdfWeightedWord2VecTitlesVectorsSub, categoriesVectorsSub, subCategoriesVectorsSub, teacherPrefixVectorsSub, schoolStateVectorsSub, projectGradeVectorsSub, priceStandardizedSub, previouslyPostedStandardizedSub))
print(allFeatures.shape)
(6000, 4249)
In [0]:
perplexityValues = [5, 10, 30, 50, 80, 100]
for perplexityValue in perplexityValues:
    tsne = TSNE(n_components = 2, perplexity = perplexityValue, learning_rate = 200);
    allFeaturesEmbedded = tsne.fit_transform(allFeatures.toarray());
    allFeaturesTsneData = np.hstack((allFeaturesEmbedded, classesDataSub.reshape(-1, 1)));
    allFeaturesTsneDataFrame = pd.DataFrame(allFeaturesTsneData, columns = ['Dimension1', 'Dimension2', 'Class']);
    colors = {0.0:'red', 1.0:'green'}
    plt.title("TSNE plot for merged data of all vectorized features and Categorical, Numerical features - Perplexity({})".format(perplexityValue));
    plt.scatter(allFeaturesTsneDataFrame['Dimension1'], allFeaturesTsneDataFrame['Dimension2'], c = allFeaturesTsneDataFrame['Class'].apply(lambda x: colors[x]));
    plt.show();

Conclusion about data visualization using t-sne:

  1. Bag of Words, Tf-Idf are better than word2vec vectorizations because of forming some small group of clusters with less overlap of overall data when compared to others.
  2. Higher perplexity values seems better in data visualization because of less overlap of data than others.
  3. None of the techniques are useful for classification because of huge overlap of data.
  4. It is not seperable problem in 2-dimensions but it may be seperable in higher dimensions.

Classification & Modelling using support vector machine

Classification using data(orginal dimensions) by support vector machine

Splitting Data(Only training and test)

In [0]:
projectsData = projectsData.dropna(subset = ['teacher_prefix']);
projectsData.shape
Out[0]:
(109245, 24)
In [0]:
classesData = projectsData['project_is_approved']
print(classesData.shape)
(109245,)
In [0]:
trainingData, testData, classesTraining, classesTest = model_selection.train_test_split(projectsData, classesData, test_size =  0.3, random_state = 0, stratify = classesData);
trainingData, crossValidateData, classesTraining, classesCrossValidate = model_selection.train_test_split(trainingData, classesTraining, test_size = 0.3, random_state = 0, stratify = classesTraining);
In [0]:
print("Shapes of splitted data: ");
equalsBorder(70);

print("testData shape: ", testData.shape);
print("classesTest: ", classesTest.shape);
print("trainingData shape: ", trainingData.shape);
print("classesTraining shape: ", classesTraining.shape);
Shapes of splitted data: 
======================================================================
testData shape:  (32774, 24)
classesTest:  (32774,)
trainingData shape:  (53529, 24)
classesTraining shape:  (53529,)
In [0]:
print("Number of negative points: ", trainingData[trainingData['project_is_approved'] == 0].shape);
print("Number of positive points: ", trainingData[trainingData['project_is_approved'] == 1].shape);
Number of negative points:  (8105, 24)
Number of positive points:  (45424, 24)
In [0]:
vectorizedFeatureNames = [];

Balancing Data

Note: Instead of displaying whole vectorization process for balanced and imbalanced data, we have simply disabled below cell while performing analysis on imbalanced data and enabled while performing analysis on balanced data
In [0]:
negativeData = trainingData[trainingData['project_is_approved'] == 0];
positiveData = trainingData[trainingData['project_is_approved'] == 1];
negativeDataBalanced = resample(negativeData, replace = True, n_samples = trainingData[trainingData['project_is_approved'] == 1].shape[0], random_state = 44);
trainingData = pd.concat([positiveData, negativeDataBalanced]);
trainingData = shuffle(trainingData);
classesTraining = trainingData['project_is_approved'];
print("Testing whether data is balanced: ");
equalsBorder(60);
print("Number of positive points: ", trainingData[trainingData['project_is_approved'] == 1].shape);
print("Number of negative points: ", trainingData[trainingData['project_is_approved'] == 0].shape);
Testing whether data is balanced: 
============================================================
Number of positive points:  (45424, 24)
Number of negative points:  (45424, 24)

Vectorizing categorical data

1. Vectorizing cleaned_categories(project_subject_categories cleaned) - One Hot Encoding

In [0]:
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique cleaned_categories
subjectsCategoriesVectorizer = CountVectorizer(vocabulary = list(sortedCategoriesDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with cleaned_categories values
subjectsCategoriesVectorizer.fit(trainingData['cleaned_categories'].values);
# Vectorizing categories using one-hot-encoding
categoriesVectors = subjectsCategoriesVectorizer.transform(trainingData['cleaned_categories'].values);
In [0]:
print("Features used in vectorizing categories: ");
equalsBorder(70);
print(subjectsCategoriesVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of cleaned_categories matrix after vectorization(one-hot-encoding): ", categoriesVectors.shape);
equalsBorder(70);
print("Sample vectors of categories: ");
equalsBorder(70);
print(categoriesVectors[0:4])
Features used in vectorizing categories: 
======================================================================
['Warmth', 'Care_Hunger', 'History_Civics', 'Music_Arts', 'AppliedLearning', 'SpecialNeeds', 'Health_Sports', 'Math_Science', 'Literacy_Language']
======================================================================
Shape of cleaned_categories matrix after vectorization(one-hot-encoding):  (90848, 9)
======================================================================
Sample vectors of categories: 
======================================================================
  (0, 5)	1
  (0, 6)	1
  (1, 7)	1
  (1, 8)	1
  (2, 5)	1
  (3, 6)	1

2. Vectorizing cleaned_sub_categories(project_subject_sub_categories cleaned) - One Hot Encoding

In [0]:
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique cleaned_sub_categories
subjectsSubCategoriesVectorizer = CountVectorizer(vocabulary = list(sortedDictionarySubCategories.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with cleaned_sub_categories values
subjectsSubCategoriesVectorizer.fit(trainingData['cleaned_sub_categories'].values);
# Vectorizing sub categories using one-hot-encoding
subCategoriesVectors = subjectsSubCategoriesVectorizer.transform(trainingData['cleaned_sub_categories'].values);
In [0]:
print("Features used in vectorizing subject sub categories: ");
equalsBorder(70);
print(subjectsSubCategoriesVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of cleaned_categories matrix after vectorization(one-hot-encoding): ", subCategoriesVectors.shape);
equalsBorder(70);
print("Sample vectors of categories: ");
equalsBorder(70);
print(subCategoriesVectors[0:4])
Features used in vectorizing subject sub categories: 
======================================================================
['Economics', 'CommunityService', 'FinancialLiteracy', 'ParentInvolvement', 'Extracurricular', 'Civics_Government', 'ForeignLanguages', 'NutritionEducation', 'Warmth', 'Care_Hunger', 'SocialSciences', 'PerformingArts', 'CharacterEducation', 'TeamSports', 'Other', 'College_CareerPrep', 'Music', 'History_Geography', 'Health_LifeScience', 'EarlyDevelopment', 'ESL', 'Gym_Fitness', 'EnvironmentalScience', 'VisualArts', 'Health_Wellness', 'AppliedSciences', 'SpecialNeeds', 'Literature_Writing', 'Mathematics', 'Literacy']
======================================================================
Shape of cleaned_categories matrix after vectorization(one-hot-encoding):  (90848, 30)
======================================================================
Sample vectors of categories: 
======================================================================
  (0, 21)	1
  (0, 26)	1
  (1, 18)	1
  (1, 29)	1
  (2, 26)	1
  (3, 21)	1
  (3, 24)	1

3. Vectorizing teacher_prefix - One Hot Encoding

In [0]:
def giveCounter(data):
    counter = Counter();
    for dataValue in data:
        counter.update(str(dataValue).split());
    return counter
In [0]:
giveCounter(trainingData['teacher_prefix'].values)
Out[0]:
Counter({'Dr': 7, 'Mr': 8836, 'Mrs': 46892, 'Ms': 32914, 'Teacher': 2199})
In [0]:
teacherPrefixDictionary = dict(giveCounter(trainingData['teacher_prefix'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique teacher_prefix
teacherPrefixVectorizer = CountVectorizer(vocabulary = list(teacherPrefixDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with teacher_prefix values
teacherPrefixVectorizer.fit(trainingData['teacher_prefix'].values);
# Vectorizing teacher_prefix using one-hot-encoding
teacherPrefixVectors = teacherPrefixVectorizer.transform(trainingData['teacher_prefix'].values);
In [0]:
print("Features used in vectorizing teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of teacher_prefix matrix after vectorization(one-hot-encoding): ", teacherPrefixVectors.shape);
equalsBorder(70);
print("Sample vectors of teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectors[0:100]);
Features used in vectorizing teacher_prefix: 
======================================================================
['Mr', 'Ms', 'Mrs', 'Teacher', 'Dr']
======================================================================
Shape of teacher_prefix matrix after vectorization(one-hot-encoding):  (90848, 5)
======================================================================
Sample vectors of teacher_prefix: 
======================================================================
  (0, 0)	1
  (1, 1)	1
  (2, 1)	1
  (3, 2)	1
  (4, 2)	1
  (5, 2)	1
  (6, 1)	1
  (7, 2)	1
  (8, 1)	1
  (9, 2)	1
  (10, 2)	1
  (11, 1)	1
  (12, 2)	1
  (13, 2)	1
  (14, 2)	1
  (15, 2)	1
  (16, 2)	1
  (17, 1)	1
  (18, 2)	1
  (19, 2)	1
  (20, 2)	1
  (21, 2)	1
  (22, 2)	1
  (23, 2)	1
  (24, 2)	1
  :	:
  (75, 1)	1
  (76, 2)	1
  (77, 2)	1
  (78, 2)	1
  (79, 2)	1
  (80, 1)	1
  (81, 1)	1
  (82, 2)	1
  (83, 2)	1
  (84, 2)	1
  (85, 2)	1
  (86, 1)	1
  (87, 1)	1
  (88, 2)	1
  (89, 1)	1
  (90, 2)	1
  (91, 1)	1
  (92, 2)	1
  (93, 1)	1
  (94, 2)	1
  (95, 0)	1
  (96, 1)	1
  (97, 2)	1
  (98, 2)	1
  (99, 1)	1
In [0]:
teacherPrefixes = [prefix.replace('.', '') for prefix in trainingData['teacher_prefix'].values];
teacherPrefixes[0:5]
Out[0]:
['Mr', 'Ms', 'Ms', 'Mrs', 'Mrs']
In [0]:
trainingData['teacher_prefix'] = teacherPrefixes;
trainingData.head(3)
Out[0]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved cleaned_categories cleaned_sub_categories project_essay price quantity preprocessed_titles preprocessed_essays
89311 76792 p169473 1d4e226b6c530b8163df7687eb602b66 Mr IN 2017-02-02 22:25:47 Grades3to5 Health & Sports, Special Needs Gym & Fitness, Special Needs Bouncing Ball Braniacs II My students are amazing. They do amazing thing... They like to move, they love to read and love ... NaN NaN My students need movement while working indepe... 1 0 Health_Sports SpecialNeeds Gym_Fitness SpecialNeeds My students are amazing. They do amazing thing... 129.98 2 bouncing ball braniacs ii students amazing amazing things everyday resou...
53546 90827 p127392 e1b7051c65bac32eecac29c5282efce6 Ms NC 2017-02-20 09:18:02 GradesPreKto2 Math & Science, Literacy & Language Health & Life Science, Literacy STEM LEARNing I teach at a Title I school in Charlotte, Nort... My students need day to day interaction with b... NaN NaN My students need a variety of books and hands ... 0 0 Math_Science Literacy_Language Health_LifeScience Literacy I teach at a Title I school in Charlotte, Nort... 462.97 7 stem learning teach title school charlotte north carolina 10...
90710 148630 p235592 f4d8985b398e821c6b5b3990f072890e Ms NC 2016-09-26 16:28:40 GradesPreKto2 Special Needs Special Needs Reaching Deaf and Hard of Hearing Students thr... My students come from very diverse backgrounds... Using tablets will enable my deaf and hard of ... NaN NaN My students need tablets to accommodate their ... 0 1 SpecialNeeds SpecialNeeds My students come from very diverse backgrounds... 239.94 8 reaching deaf hard hearing students technology students come diverse backgrounds blue collar ...
In [0]:
teacherPrefixDictionary = dict(giveCounter(trainingData['teacher_prefix'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique teacher_prefix
teacherPrefixVectorizer = CountVectorizer(vocabulary = list(teacherPrefixDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with teacher_prefix values
teacherPrefixVectorizer.fit(trainingData['teacher_prefix'].values);
# Vectorizing teacher_prefix using one-hot-encoding
teacherPrefixVectors = teacherPrefixVectorizer.transform(trainingData['teacher_prefix'].values);
In [0]:
print("Features used in vectorizing teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of teacher_prefix matrix after vectorization(one-hot-encoding): ", teacherPrefixVectors.shape);
equalsBorder(70);
print("Sample vectors of teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectors[0:4]);
Features used in vectorizing teacher_prefix: 
======================================================================
['Mr', 'Ms', 'Mrs', 'Teacher', 'Dr']
======================================================================
Shape of teacher_prefix matrix after vectorization(one-hot-encoding):  (90848, 5)
======================================================================
Sample vectors of teacher_prefix: 
======================================================================
  (0, 0)	1
  (1, 1)	1
  (2, 1)	1
  (3, 2)	1

4. Vectorizing school_state - One Hot Encoding

In [0]:
schoolStateDictionary = dict(giveCounter(trainingData['school_state'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique school states
schoolStateVectorizer = CountVectorizer(vocabulary = list(schoolStateDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with school_state values
schoolStateVectorizer.fit(trainingData['school_state'].values);
# Vectorizing school_state using one-hot-encoding
schoolStateVectors = schoolStateVectorizer.transform(trainingData['school_state'].values);
In [0]:
print("Features used in vectorizing school_state: ");
equalsBorder(70);
print(schoolStateVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of school_state matrix after vectorization(one-hot-encoding): ", schoolStateVectors.shape);
equalsBorder(70);
print("Sample vectors of school_state: ");
equalsBorder(70);
print(schoolStateVectors[0:4]);
Features used in vectorizing school_state: 
======================================================================
['IN', 'NC', 'VA', 'OK', 'TX', 'NY', 'MI', 'PA', 'TN', 'AK', 'ME', 'SC', 'MD', 'FL', 'LA', 'MO', 'MA', 'CT', 'NV', 'OH', 'AR', 'CA', 'GA', 'ID', 'AZ', 'AL', 'NJ', 'KY', 'MS', 'UT', 'WA', 'HI', 'IL', 'CO', 'WI', 'MN', 'NE', 'OR', 'MT', 'NM', 'KS', 'NH', 'SD', 'DE', 'DC', 'WV', 'RI', 'ND', 'IA', 'VT', 'WY']
======================================================================
Shape of school_state matrix after vectorization(one-hot-encoding):  (90848, 51)
======================================================================
Sample vectors of school_state: 
======================================================================
  (0, 0)	1
  (1, 1)	1
  (2, 1)	1
  (3, 2)	1

5. Vectorizing project_grade_category - One Hot Encoding

In [0]:
giveCounter(trainingData['project_grade_category'])
Out[0]:
Counter({'Grades3to5': 30465,
         'Grades6to8': 14131,
         'Grades9to12': 9136,
         'GradesPreKto2': 37116})
In [0]:
cleanedGrades = []
for grade in trainingData['project_grade_category'].values:
    grade = grade.replace(' ', '');
    grade = grade.replace('-', 'to');
    cleanedGrades.append(grade);
cleanedGrades[0:4]
Out[0]:
['Grades3to5', 'GradesPreKto2', 'GradesPreKto2', 'GradesPreKto2']
In [0]:
trainingData['project_grade_category'] = cleanedGrades
trainingData.head(4)
Out[0]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved cleaned_categories cleaned_sub_categories project_essay price quantity preprocessed_titles preprocessed_essays
89311 76792 p169473 1d4e226b6c530b8163df7687eb602b66 Mr IN 2017-02-02 22:25:47 Grades3to5 Health & Sports, Special Needs Gym & Fitness, Special Needs Bouncing Ball Braniacs II My students are amazing. They do amazing thing... They like to move, they love to read and love ... NaN NaN My students need movement while working indepe... 1 0 Health_Sports SpecialNeeds Gym_Fitness SpecialNeeds My students are amazing. They do amazing thing... 129.98 2 bouncing ball braniacs ii students amazing amazing things everyday resou...
53546 90827 p127392 e1b7051c65bac32eecac29c5282efce6 Ms NC 2017-02-20 09:18:02 GradesPreKto2 Math & Science, Literacy & Language Health & Life Science, Literacy STEM LEARNing I teach at a Title I school in Charlotte, Nort... My students need day to day interaction with b... NaN NaN My students need a variety of books and hands ... 0 0 Math_Science Literacy_Language Health_LifeScience Literacy I teach at a Title I school in Charlotte, Nort... 462.97 7 stem learning teach title school charlotte north carolina 10...
90710 148630 p235592 f4d8985b398e821c6b5b3990f072890e Ms NC 2016-09-26 16:28:40 GradesPreKto2 Special Needs Special Needs Reaching Deaf and Hard of Hearing Students thr... My students come from very diverse backgrounds... Using tablets will enable my deaf and hard of ... NaN NaN My students need tablets to accommodate their ... 0 1 SpecialNeeds SpecialNeeds My students come from very diverse backgrounds... 239.94 8 reaching deaf hard hearing students technology students come diverse backgrounds blue collar ...
62408 74797 p026937 3bf94fbc0344a96a42edaf4de88b3de4 Mrs VA 2016-08-10 23:34:35 GradesPreKto2 Health & Sports Gym & Fitness, Health & Wellness Pedaling Our Way through K! We are a wonderful and energetic group of Kind... Let's think of a solution to this problem........ NaN NaN My students need elliptical trainers to keep t... 20 1 Health_Sports Gym_Fitness Health_Wellness We are a wonderful and energetic group of Kind... 99.97 6 pedaling way k wonderful energetic group kindergarten student...
In [0]:
projectGradeDictionary = dict(giveCounter(trainingData['project_grade_category'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique project grade categories
projectGradeVectorizer = CountVectorizer(vocabulary = list(projectGradeDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with project_grade_category values
projectGradeVectorizer.fit(trainingData['project_grade_category'].values);
# Vectorizing project_grade_category using one-hot-encoding
projectGradeVectors = projectGradeVectorizer.transform(trainingData['project_grade_category'].values);
In [0]:
print("Features used in vectorizing project_grade_category: ");
equalsBorder(70);
print(projectGradeVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of school_state matrix after vectorization(one-hot-encoding): ", projectGradeVectors.shape);
equalsBorder(70);
print("Sample vectors of school_state: ");
equalsBorder(70);
print(projectGradeVectors[0:4]);
Features used in vectorizing project_grade_category: 
======================================================================
['Grades3to5', 'GradesPreKto2', 'Grades6to8', 'Grades9to12']
======================================================================
Shape of school_state matrix after vectorization(one-hot-encoding):  (90848, 4)
======================================================================
Sample vectors of school_state: 
======================================================================
  (0, 0)	1
  (1, 1)	1
  (2, 1)	1
  (3, 1)	1

Vectorizing Text Data

In [0]:
preProcessedEssaysWithStopWords, preProcessedEssaysWithoutStopWords = preProcessingWithAndWithoutStopWords(trainingData['project_essay']);
preProcessedProjectTitlesWithStopWords, preProcessedProjectTitlesWithoutStopWords = preProcessingWithAndWithoutStopWords(trainingData['project_title']);


In [0]:
bagOfWordsVectorizedFeatures = [];

Bag of Words

1. Vectorizing project_essay

In [0]:
# Initializing countvectorizer for bag of words vectorization of preprocessed project essays
bowEssayVectorizer = CountVectorizer(min_df = 10, max_features = 5000);
# Transforming the preprocessed essays to bag of words vectors
bowEssayModel = bowEssayVectorizer.fit_transform(preProcessedEssaysWithoutStopWords);
In [0]:
print("Some of the Features used in vectorizing preprocessed essays: ");
equalsBorder(70);
print(bowEssayVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed essay matrix after vectorization: ", bowEssayModel.shape);
equalsBorder(70);
print("Sample bag-of-words vector of preprocessed essay: ");
equalsBorder(70);
print(bowEssayModel[0])
Some of the Features used in vectorizing preprocessed essays: 
======================================================================
['worry', 'worrying', 'worse', 'worst', 'worth', 'worthy', 'would', 'wow', 'wrestling', 'write', 'writer', 'writers', 'writing', 'writings', 'written', 'wrong', 'wrote', 'xylophones', 'yard', 'year', 'yearbook', 'yearly', 'yearn', 'yearning', 'years', 'yes', 'yesterday', 'yet', 'yoga', 'york', 'young', 'younger', 'youngest', 'youth', 'youtube', 'zest', 'zip', 'zone', 'zones', 'zoo']
======================================================================
Shape of preprocessed essay matrix after vectorization:  (90848, 5000)
======================================================================
Sample bag-of-words vector of preprocessed essay: 
======================================================================
  (0, 101)	1
  (0, 3292)	1
  (0, 1017)	1
  (0, 1425)	1
  (0, 1640)	1
  (0, 2420)	1
  (0, 2249)	1
  (0, 4860)	1
  (0, 1094)	1
  (0, 1151)	1
  (0, 1696)	1
  (0, 741)	1
  (0, 2266)	1
  (0, 2718)	1
  (0, 2231)	1
  (0, 1029)	2
  (0, 92)	1
  (0, 2115)	1
  (0, 3450)	1
  (0, 1366)	1
  (0, 4838)	1
  (0, 2927)	1
  (0, 4699)	1
  (0, 3994)	1
  (0, 4752)	1
  :	:
  (0, 4089)	1
  (0, 3810)	1
  (0, 818)	2
  (0, 4496)	1
  (0, 2992)	1
  (0, 510)	1
  (0, 4966)	1
  (0, 2808)	3
  (0, 4979)	1
  (0, 4279)	1
  (0, 4439)	1
  (0, 4041)	1
  (0, 4751)	1
  (0, 990)	1
  (0, 2761)	2
  (0, 3720)	1
  (0, 1935)	2
  (0, 68)	1
  (0, 3960)	4
  (0, 2038)	1
  (0, 3823)	1
  (0, 1648)	1
  (0, 4545)	3
  (0, 250)	4
  (0, 4363)	4

2. Vectorizing project_title

In [0]:
# Initializing countvectorizer for bag of words vectorization of preprocessed project titles
bowTitleVectorizer = CountVectorizer(min_df = 10);
# Transforming the preprocessed project titles to bag of words vectors
bowTitleModel = bowTitleVectorizer.fit_transform(preProcessedProjectTitlesWithoutStopWords);
In [0]:
print("Some of the Features used in vectorizing preprocessed titles: ");
equalsBorder(70);
print(bowTitleVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after vectorization: ", bowTitleModel.shape);
equalsBorder(70);
print("Sample bag-of-words vector of preprocessed title: ");
equalsBorder(70);
print(bowTitleModel[0])
Some of the Features used in vectorizing preprocessed titles: 
======================================================================
['work', 'workers', 'working', 'workout', 'works', 'worksheets', 'workshop', 'world', 'worlds', 'worldwide', 'worms', 'worth', 'would', 'wow', 'wrestling', 'write', 'writer', 'writers', 'writing', 'written', 'xylophone', 'ye', 'yeah', 'year', 'yearbook', 'yearbooks', 'years', 'yes', 'yet', 'yoga', 'yogi', 'yogis', 'young', 'youngest', 'youngsters', 'youth', 'youtube', 'zearn', 'zone', 'zoom']
======================================================================
Shape of preprocessed title matrix after vectorization:  (90848, 3095)
======================================================================
Sample bag-of-words vector of preprocessed title: 
======================================================================
  (0, 1407)	1
  (0, 210)	1
  (0, 317)	1

Tf-Idf Vectorization

1. Vectorizing project_essay

In [0]:
# Intializing tfidf vectorizer for tf-idf vectorization of preprocessed project essays
tfIdfEssayVectorizer = TfidfVectorizer(min_df = 10, max_features = 5000);
# Transforming the preprocessed project essays to tf-idf vectors
tfIdfEssayModel = tfIdfEssayVectorizer.fit_transform(preProcessedEssaysWithoutStopWords);
In [0]:
print("Some of the Features used in tf-idf vectorizing preprocessed essays: ");
equalsBorder(70);
print(tfIdfEssayVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after tf-idf vectorization: ", tfIdfEssayModel.shape);
equalsBorder(70);
print("Sample Tf-Idf vector of preprocessed essay: ");
equalsBorder(70);
print(tfIdfEssayModel[0])
Some of the Features used in tf-idf vectorizing preprocessed essays: 
======================================================================
['worry', 'worrying', 'worse', 'worst', 'worth', 'worthy', 'would', 'wow', 'wrestling', 'write', 'writer', 'writers', 'writing', 'writings', 'written', 'wrong', 'wrote', 'xylophones', 'yard', 'year', 'yearbook', 'yearly', 'yearn', 'yearning', 'years', 'yes', 'yesterday', 'yet', 'yoga', 'york', 'young', 'younger', 'youngest', 'youth', 'youtube', 'zest', 'zip', 'zone', 'zones', 'zoo']
======================================================================
Shape of preprocessed title matrix after tf-idf vectorization:  (90848, 5000)
======================================================================
Sample Tf-Idf vector of preprocessed essay: 
======================================================================
  (0, 4363)	0.09338285709699373
  (0, 250)	0.2774295147346398
  (0, 4545)	0.19030520390679861
  (0, 1648)	0.07506548898482561
  (0, 3823)	0.06298837695181407
  (0, 2038)	0.08186331143701778
  (0, 3960)	0.10793599522057999
  (0, 68)	0.12376130117509535
  (0, 1935)	0.1041457222313888
  (0, 3720)	0.06601195465336002
  (0, 2761)	0.11190173509172928
  (0, 990)	0.11251340346056221
  (0, 4751)	0.09313193950559152
  (0, 4041)	0.08884170218951208
  (0, 4439)	0.10230624797151606
  (0, 4279)	0.1053237317636152
  (0, 4979)	0.04633460626314362
  (0, 2808)	0.11067473614754454
  (0, 4966)	0.047931696918141184
  (0, 510)	0.07882165019949874
  (0, 2992)	0.08637617223966222
  (0, 4496)	0.05526737580063456
  (0, 818)	0.06532796001441747
  (0, 3810)	0.07406723261358955
  (0, 4089)	0.10732197989755879
  :	:
  (0, 4752)	0.06337984598967719
  (0, 3994)	0.06053709682350431
  (0, 4699)	0.10076461640088
  (0, 2927)	0.11233185215197106
  (0, 4838)	0.08749358178030474
  (0, 1366)	0.10434739819538644
  (0, 3450)	0.08393908715099366
  (0, 2115)	0.08703285038171124
  (0, 92)	0.04289495908057561
  (0, 1029)	0.20090475539219624
  (0, 2231)	0.0570569176784705
  (0, 2718)	0.06912546879648293
  (0, 2266)	0.07285305869243694
  (0, 741)	0.12306053644979925
  (0, 1696)	0.06145885846072847
  (0, 1151)	0.04422148352844308
  (0, 1094)	0.06892351584276195
  (0, 4860)	0.05408405656879022
  (0, 2249)	0.12372180762797393
  (0, 2420)	0.09126006287510087
  (0, 1640)	0.06278378089291439
  (0, 1425)	0.13878374007229743
  (0, 1017)	0.07746679604250473
  (0, 3292)	0.11213975579306085
  (0, 101)	0.0676216030686508

2. Vectorizing project_title

In [0]:
# Intializing tfidf vectorizer for tf-idf vectorization of preprocessed project titles
tfIdfTitleVectorizer = TfidfVectorizer(min_df = 10);
# Transforming the preprocessed project titles to tf-idf vectors
tfIdfTitleModel = tfIdfTitleVectorizer.fit_transform(preProcessedProjectTitlesWithoutStopWords);
In [0]:
print("Some of the Features used in tf-idf vectorizing preprocessed titles: ");
equalsBorder(70);
print(tfIdfTitleVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after tf-idf vectorization: ", tfIdfTitleModel.shape);
equalsBorder(70);
print("Sample Tf-Idf vector of preprocessed title: ");
equalsBorder(70);
print(tfIdfTitleModel[0])
Some of the Features used in tf-idf vectorizing preprocessed titles: 
======================================================================
['work', 'workers', 'working', 'workout', 'works', 'worksheets', 'workshop', 'world', 'worlds', 'worldwide', 'worms', 'worth', 'would', 'wow', 'wrestling', 'write', 'writer', 'writers', 'writing', 'written', 'xylophone', 'ye', 'yeah', 'year', 'yearbook', 'yearbooks', 'years', 'yes', 'yet', 'yoga', 'yogi', 'yogis', 'young', 'youngest', 'youngsters', 'youth', 'youtube', 'zearn', 'zone', 'zoom']
======================================================================
Shape of preprocessed title matrix after tf-idf vectorization:  (90848, 3095)
======================================================================
Sample Tf-Idf vector of preprocessed title: 
======================================================================
  (0, 317)	0.5902446059690166
  (0, 210)	0.5571317262160165
  (0, 1407)	0.5841365805768732

Average Word2Vector Vectorization

In [0]:
# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/
# We should have glove_vectors file for creating below model
with open('drive/My Drive/glove_vectors', 'rb') as f:
    gloveModel = pickle.load(f)
    gloveWords =  set(gloveModel.keys())
In [0]:
print("Glove vector of sample word: ");
equalsBorder(70);
print(gloveModel['technology']);
equalsBorder(70);
print("Shape of glove vector: ", gloveModel['technology'].shape);
Glove vector of sample word: 
======================================================================
[-0.26078   -0.36898   -0.022831   0.21666    0.16672   -0.20268
 -3.1219     0.33057    0.71512    0.28874    0.074368  -0.033203
  0.23783    0.21052    0.076562   0.13007   -0.31706   -0.45888
 -0.45463   -0.13191    0.49761    0.072704   0.16811    0.18846
 -0.16688   -0.21973    0.08575   -0.19577   -0.2101    -0.32436
 -0.56336    0.077996  -0.22758   -0.66569    0.14824    0.038945
  0.50881   -0.1352     0.49966   -0.4401    -0.022335  -0.22744
  0.22086    0.21865    0.36647    0.30495   -0.16565    0.038759
  0.28108   -0.2167     0.12453    0.65401    0.34584   -0.2557
 -0.046363  -0.31111   -0.020936  -0.17122   -0.77114    0.29289
 -0.14625    0.39541   -0.078938   0.051127   0.15076    0.085126
  0.183     -0.06755    0.26312    0.0087276  0.0066415  0.37033
  0.03496   -0.12627   -0.052626  -0.34897    0.14672    0.14799
 -0.21821   -0.042785   0.2661    -1.1105     0.31789    0.27278
  0.054468  -0.27458    0.42732   -0.44101   -0.19302   -0.32948
  0.61501   -0.22301   -0.36354   -0.34983   -0.16125   -0.17195
 -3.363      0.45146   -0.13753    0.31107    0.2061     0.33063
  0.45879    0.24256    0.042342   0.074837  -0.12869    0.12066
  0.42843   -0.4704    -0.18937    0.32685    0.26079    0.20518
 -0.18432   -0.47658    0.69193    0.18731   -0.12516    0.35447
 -0.1969    -0.58981   -0.88914    0.5176     0.13177   -0.078557
  0.032963  -0.19411    0.15109    0.10547   -0.1113    -0.61533
  0.0948    -0.3393    -0.20071   -0.30197    0.29531    0.28017
  0.16049    0.25294   -0.44266   -0.39412    0.13486    0.25178
 -0.044114   1.1519     0.32234   -0.34323   -0.10713   -0.15616
  0.031206   0.46636   -0.52761   -0.39296   -0.068424  -0.04072
  0.41508   -0.34564    0.71001   -0.364      0.2996     0.032281
  0.34035    0.23452    0.78342    0.48045   -0.1609     0.40102
 -0.071795  -0.16531    0.082153   0.52065    0.24194    0.17113
  0.33552   -0.15725   -0.38984    0.59337   -0.19388   -0.39864
 -0.47901    1.0835     0.24473    0.41309    0.64952    0.46846
  0.024386  -0.72087   -0.095061   0.10095   -0.025229   0.29435
 -0.57696    0.53166   -0.0058338 -0.3304     0.19661   -0.085206
  0.34225    0.56262    0.19924   -0.027111  -0.44567    0.17266
  0.20887   -0.40702    0.63954    0.50708   -0.31862   -0.39602
 -0.1714    -0.040006  -0.45077   -0.32482   -0.0316     0.54908
 -0.1121     0.12951   -0.33577   -0.52768   -0.44592   -0.45388
  0.66145    0.33023   -1.9089     0.5318     0.21626   -0.13152
  0.48258    0.68028   -0.84115   -0.51165    0.40017    0.17233
 -0.033749   0.045275   0.37398   -0.18252    0.19877    0.1511
  0.029803   0.16657   -0.12987   -0.50489    0.55311   -0.22504
  0.13085   -0.78459    0.36481   -0.27472    0.031805   0.53052
 -0.20078    0.46392   -0.63554    0.040289  -0.19142   -0.0097011
  0.068084  -0.10602    0.25567    0.096125  -0.10046    0.15016
 -0.26733   -0.26494    0.057888   0.062678  -0.11596    0.28115
  0.25375   -0.17954    0.20615    0.24189    0.062696   0.27719
 -0.42601   -0.28619   -0.44697   -0.082253  -0.73415   -0.20675
 -0.60289   -0.06728    0.15666   -0.042614   0.41368   -0.17367
 -0.54012    0.23883    0.23075    0.13608   -0.058634  -0.089705
  0.18469    0.023634   0.16178    0.23384    0.24267    0.091846 ]
======================================================================
Shape of glove vector:  (300,)
In [0]:
def getWord2VecVectors(texts):
    word2VecTextsVectors = [];
    for preProcessedText in tqdm(texts):
        word2VecTextVector = np.zeros(300);
        numberOfWordsInText = 0;
        for word in preProcessedText.split():
            if word in gloveWords:
                word2VecTextVector += gloveModel[word];
                numberOfWordsInText += 1;
        if numberOfWordsInText != 0:
            word2VecTextVector = word2VecTextVector / numberOfWordsInText;
        word2VecTextsVectors.append(word2VecTextVector);
    return word2VecTextsVectors;

1. Vectorizing project_essay

In [0]:
word2VecEssaysVectors = getWord2VecVectors(preProcessedEssaysWithoutStopWords);

In [0]:
print("Shape of Word2Vec vectorization matrix of essays: {},{}".format(len(word2VecEssaysVectors), len(word2VecEssaysVectors[0])));
equalsBorder(70);
print("Sample essay: ");
equalsBorder(70);
print(preProcessedEssaysWithoutStopWords[0]);
equalsBorder(70);
print("Word2Vec vector of sample essay: ");
equalsBorder(70);
print(word2VecEssaysVectors[0]);
Shape of Word2Vec vectorization matrix of essays: 90848,300
======================================================================
Sample essay: 
======================================================================
students amazing amazing things everyday resources given school 70 free reduced lunch considered urban setting school system started school year noticing many students would benefit movement technology classroom research shows ball chairs exercise balls stand chairs help students focus fire brains asking amazing things help amazing learners lot energy funny students great leaders successful future donations make dreams become reality like move love read love lots positive attention many raised single parent households receive free lunch based socioeconomic status things may prevent getting ahead early life may not provide life experiences many us see typical minute walk door classroom focus potential growth may not able control home lives however certainly control experience school day creative positive way hopeful inspire even earliest learners continue path academic excellencenannan
======================================================================
Word2Vec vector of sample essay: 
======================================================================
[-2.30561648e-02  6.40555880e-02 -2.40005832e-02 -7.53136424e-02
 -2.89873520e-02 -3.43800600e-03 -3.19359600e+00  1.09771573e-01
 -1.69693360e-02 -5.44965360e-02 -1.02372448e-02  1.86768408e-02
  4.56463536e-02 -8.24328872e-02 -5.41905336e-02 -1.23099008e-02
 -4.60638872e-02 -2.47778680e-02 -1.30629664e-02  2.02156984e-03
  3.68665201e-02 -8.62291440e-03  3.37056347e-02  1.09497976e-02
  3.74181920e-03  1.61854960e-02  9.24549840e-02 -1.03988968e-02
 -9.63670640e-03  6.28680400e-03 -2.91992595e-01 -6.18385760e-02
  4.26785712e-02  9.60849920e-02 -7.11235898e-02 -2.83650080e-02
 -2.16746848e-02 -4.28636093e-02 -4.74191680e-03 -6.39047296e-02
 -5.49670560e-02  6.77416544e-02  4.30427361e-02 -1.16874275e-01
 -4.20485440e-03  1.61846000e-03  9.38261952e-02 -3.12237331e-02
 -4.38814176e-02 -1.07803781e-01 -1.58663416e-03 -1.78526288e-02
 -5.89043120e-03  1.43535240e-02  4.48685928e-02 -7.89809676e-02
  6.26832832e-02 -7.89059794e-02 -2.10129696e-02  1.10805277e-01
 -5.63471050e-02 -6.51035401e-02  7.31532278e-02  2.81239752e-03
 -9.78070760e-02  7.04880480e-02  2.71226056e-03 -3.89966736e-02
  1.62376142e-01 -6.62020112e-02 -9.76644472e-02  3.43955040e-03
  1.53187568e-03 -7.91365456e-02 -4.34214056e-02 -1.37325956e-01
  6.02649992e-02  5.14981368e-02  5.04272960e-02 -3.57419888e-02
  5.03336626e-02 -4.32656600e-01  2.10347152e-03  7.17743480e-03
 -1.55604236e-01 -1.50561570e-02  1.26340065e-01 -5.81456600e-02
  1.20705928e-01 -2.07978010e-02  7.03270279e-02  7.56759040e-03
 -2.57828960e-03  8.47251520e-03 -7.80047760e-03 -1.77558762e-01
 -2.19540514e+00  4.82423532e-02  1.52136446e-01  8.78327968e-02
 -1.29850021e-01  7.59469808e-02  1.15844322e-01 -6.45294400e-02
  5.71785784e-02 -1.73344856e-02 -2.92811312e-02 -2.12482850e-01
  9.19962640e-02  2.00639024e-02 -1.09403960e-02 -5.74172326e-02
  3.37093866e-02  1.93825118e-01  1.09938080e-02  1.18557226e-01
 -2.30364320e-01  3.11792728e-02  8.67331832e-02  3.90003148e-02
  3.69016544e-02  2.34576439e-02 -1.17421538e-02 -1.27508019e-01
 -1.83127408e-02 -1.02537360e-03  7.89146944e-02 -7.43879000e-02
 -3.65652312e-02  2.48802912e-02 -2.33221600e-04 -1.34758072e-02
 -2.64235576e-02 -3.44704472e-02 -4.18029448e-02 -1.07624961e-01
  5.59795440e-02 -7.24954850e-02  3.59956400e-02  2.79742526e-01
  5.40363512e-02  2.92429376e-02 -1.03688376e-02 -5.00297544e-02
 -3.71440720e-02 -4.56431888e-03  5.41354320e-03  1.49070464e-02
  1.69523369e-01  1.76588080e-02 -2.96603760e-02  2.40781184e-02
  4.53353224e-02 -1.62569592e-02  3.41673280e-02  7.61677568e-02
  3.86184048e-02 -4.89776280e-02 -4.10505088e-02 -2.07796560e-03
  5.20050976e-02 -3.29153968e-02 -5.57145752e-02 -6.00489400e-02
 -6.35427206e-02  7.53815344e-02 -1.03974395e-01  4.19362358e-02
  1.25156019e-01 -9.56814448e-02 -6.58241024e-02  3.49378184e-02
 -7.95565768e-02 -8.60128880e-02 -5.93250608e-02  2.64589336e-02
 -1.37934817e-02  1.48758656e-02 -1.13018622e-01 -6.42359019e-02
  3.47599736e-02  2.59894612e-01  5.65672752e-02 -5.03248226e-03
 -6.14266128e-02 -2.25997664e-02 -8.59891984e-02 -2.71916032e-02
  9.35608416e-02  1.45400880e-03 -1.17596408e-02 -5.06474440e-02
 -7.79679752e-02  1.22193800e-02  2.42334496e-02 -7.85081568e-02
 -4.90978648e-02  5.59902616e-02  4.44772528e-02  4.50372368e-02
  7.75557656e-02  1.60503864e-02 -2.24388336e-02  5.72164784e-02
 -8.83505536e-02  5.44015440e-02  6.63738336e-02 -6.32723200e-02
  1.15537367e-01  1.13188952e-02 -1.34292928e-02 -1.31710664e-02
 -1.43260992e-02 -1.61068632e-01 -5.78670154e-02 -1.48517360e-02
 -8.34769384e-02 -4.39446240e-02 -5.24271520e-03  1.73721192e-02
 -9.32646448e-02 -4.41995975e-02 -3.85715704e-02 -8.88375856e-02
 -2.22703928e+00  4.73601264e-02  3.55251208e-02  7.02042080e-02
 -2.76063928e-02 -1.52534304e-01  6.03085920e-03 -4.08157240e-02
 -6.18830000e-02 -9.66357008e-03 -7.07969280e-02  7.63101440e-02
  1.07335282e-01 -7.36330120e-02  3.61103400e-02  1.03569039e-01
 -2.27049216e-02  4.10729216e-02 -2.27608837e-01  1.16083577e-01
 -4.34220501e-02  1.42304440e-02 -5.54313360e-02 -3.05713768e-02
  2.09292022e-02  2.40047295e-02  2.16410216e-02  7.90204656e-02
  3.72523829e-02 -4.97593632e-02  8.66129056e-02  2.33321808e-02
  1.10064716e-01 -7.26371744e-02  7.27992976e-02 -6.33331496e-02
  1.99350800e-02  1.57000535e-02  1.76601360e-02  3.73326512e-02
  3.62064640e-02 -4.36296440e-02 -9.85322576e-02  1.95353920e-02
  6.02981130e-02  7.30323270e-02 -5.41691472e-02 -1.82366480e-03
 -1.44784791e-01  3.12362158e-02 -6.23063792e-02 -1.13728720e-02
  6.26857136e-02 -3.18937200e-03 -3.09574901e-02  6.16588696e-02
  2.64560561e-01 -5.16752928e-02 -7.37153712e-02  8.10807040e-03
 -5.90479504e-03  1.22212421e-01  2.24339544e-02  5.47889752e-03
 -1.66582392e-02 -5.04148400e-02  2.13114560e-02 -5.27879170e-02
 -6.98875520e-02 -1.27442425e-01  1.05724815e-01  1.58399552e-02
 -3.70167400e-02  1.38915078e-01  9.25899056e-02 -1.05696656e-02]

2. Vectorizing project_title

In [0]:
word2VecTitlesVectors = getWord2VecVectors(preProcessedProjectTitlesWithoutStopWords);

In [0]:
print("Shape of Word2Vec vectorization matrix of project titles: {}, {}".format(len(word2VecTitlesVectors), len(word2VecTitlesVectors[0])));
equalsBorder(70);
print("Sample title: ");
equalsBorder(70);
print(preProcessedProjectTitlesWithoutStopWords[0]);
equalsBorder(70);
print("Word2Vec vector of sample title: ");
equalsBorder(70);
print(word2VecTitlesVectors[0]);
Shape of Word2Vec vectorization matrix of project titles: 90848, 300
======================================================================
Sample title: 
======================================================================
bouncing ball braniacs ii
======================================================================
Word2Vec vector of sample title: 
======================================================================
[-0.336635   -0.00915675  0.2817075  -0.253605    0.23995    -0.201835
 -1.1381575  -0.12155827 -0.3352425  -0.2551125  -0.05294975 -0.05051225
  0.06772    -0.4771125   0.0299145  -0.16241325 -0.288105   -0.02172525
  0.1714875  -0.33404475 -0.0170445  -0.05812525  0.3059125   0.14346175
 -0.09921075  0.05695075  0.2249     -0.12116585  0.0907555   0.2512975
  0.1359625  -0.118853   -0.15462175 -0.40275625  0.11802987  0.13101375
  0.0551175  -0.10964025  0.24963375  0.120858    0.09476372  0.357815
 -0.13900925 -0.0199465  -0.0190066  -0.08472718  0.25498683  0.04681
 -0.1376975   0.18744925  0.1478311   0.11250575 -0.03912875  0.02810875
 -0.240922    0.0569575  -0.06094625 -0.0507445   0.20964275 -0.090239
  0.255575    0.1990775  -0.10106275 -0.02469325  0.15722975  0.2252925
 -0.221467    0.15501925  0.13416875  0.22607475  0.00205275 -0.00539
  0.0124175   0.04912587  0.08578225 -0.396367    0.2370925  -0.1300975
  0.132495   -0.18061922 -0.1904815  -0.049554   -0.025284   -0.03622329
 -0.047465    0.0505305   0.09982375  0.055319   -0.016105   -0.1189475
  0.0978925  -0.04650325  0.0658365  -0.2832125  -0.0707      0.17901993
 -1.0873125   0.27816025 -0.0532725   0.1016915  -0.1937825  -0.1044975
  0.02139125  0.05308125 -0.07052925 -0.13363825 -0.11320325 -0.1560875
  0.04652925  0.02762475 -0.1170825   0.01717425  0.021497    0.23254202
  0.28497795  0.08337175 -0.2179425  -0.31576675 -0.07227075  0.14477875
 -0.15355     0.146728   -0.08660675 -0.07808525  0.02034275 -0.09680318
 -0.1098725   0.152295    0.02324575 -0.15756075 -0.1852825  -0.1213465
  0.07850525 -0.3827375  -0.04937735  0.015855   -0.18583175 -0.0200255
  0.020918    0.4250625   0.01725     0.105586    0.34282715  0.1830965
  0.24212842  0.1557355  -0.13976425 -0.2166955  -0.2809825   0.10823625
  0.1079      0.4368375   0.10729625 -0.115455    0.2276325   0.11790375
 -0.1756775   0.006184   -0.04537    -0.0847895  -0.062923    0.169907
 -0.28834125 -0.0200445  -0.170235    0.191244   -0.11915     0.29131775
  0.05920275 -0.0755495  -0.31788    -0.089691   -0.0286655   0.18999325
  0.060341   -0.0315455  -0.06606675 -0.09516025  0.0581825  -0.10036025
 -0.088235   -0.223241    0.057286   -0.032605    0.344745    0.0616
  0.0775385  -0.07548125 -0.03460675 -0.04629475 -0.07872922  0.146759
 -0.38942275 -0.04719275  0.23074575  0.22002     0.07387525 -0.0739985
 -0.2441005  -0.0265825   0.0700324  -0.0850145   0.07054    -0.02152
 -0.24060325  0.174163   -0.3568475  -0.06513225 -0.019285   -0.084817
  0.0653375  -0.1257075  -0.36335    -0.2098375  -0.02569425 -0.0912125
 -0.2598      0.11244075  0.25192775 -0.0678888  -0.00534    -0.0061775
 -0.1054775   0.18570772 -1.78391    -0.225705   -0.09464453  0.063056
 -0.1218375   0.0524425  -0.08587325 -0.18246     0.0997305   0.2388295
 -0.1937365   0.2008675   0.2983255  -0.07110032 -0.01008425 -0.1466315
  0.09383082 -0.015793    0.189475    0.27019    -0.10693775  0.29375375
  0.24669725  0.2202475  -0.20541275  0.0486875   0.019217    0.07415122
 -0.1262765  -0.1589375   0.121433    0.0390025   0.0027515  -0.03277575
  0.0691705  -0.09088325 -0.031395   -0.00799475  0.0261845   0.15528275
  0.04237225 -0.14738125 -0.00269775 -0.0147715  -0.1896475   0.3068
  0.197056   -0.01734025  0.0759      0.01409525  0.267163    0.03872525
  0.050848    0.49742    -0.1180865  -0.2010925   0.6211725   0.083475
  0.10527    -0.0809625  -0.23594375 -0.3302115   0.201759    0.37104375
 -0.0145975   0.10602625 -0.1659505   0.11503475 -0.1888775   0.1167845
 -0.108985    0.07096575  0.2850408   0.0345215  -0.0926825   0.1234415 ]

Tf-Idf Weighted Word2Vec Vectorization

1. Vectorizing project_essay

In [0]:
# Initializing tfidf vectorizer
tfIdfEssayTempVectorizer = TfidfVectorizer();
# Vectorizing preprocessed essays using tfidf vectorizer initialized above 
tfIdfEssayTempVectorizer.fit(preProcessedEssaysWithoutStopWords);
# Saving dictionary in which each word is key and it's idf is value
tfIdfEssayDictionary = dict(zip(tfIdfEssayTempVectorizer.get_feature_names(), list(tfIdfEssayTempVectorizer.idf_)));
# Creating set of all unique words used by tfidf vectorizer
tfIdfEssayWords = set(tfIdfEssayTempVectorizer.get_feature_names());
In [0]:
# Creating list to save tf-idf weighted vectors of essays
tfIdfWeightedWord2VecEssaysVectors = [];
# Iterating over each essay
for essay in tqdm(preProcessedEssaysWithoutStopWords):
    # Sum of tf-idf values of all words in a particular essay
    cumulativeSumTfIdfWeightOfEssay = 0;
    # Tf-Idf weighted word2vec vector of a particular essay
    tfIdfWeightedWord2VecEssayVector = np.zeros(300);
    # Splitting essay into list of words
    splittedEssay = essay.split();
    # Iterating over each word
    for word in splittedEssay:
        # Checking if word is in glove words and set of words used by tfIdf essay vectorizer
        if (word in gloveWords) and (word in tfIdfEssayWords):
            # Tf-Idf value of particular word in essay
            tfIdfValueWord = tfIdfEssayDictionary[word] * (essay.count(word) / len(splittedEssay));
            # Making tf-idf weighted word2vec
            tfIdfWeightedWord2VecEssayVector += tfIdfValueWord * gloveModel[word];
            # Summing tf-idf weight of word to cumulative sum
            cumulativeSumTfIdfWeightOfEssay += tfIdfValueWord;
    if cumulativeSumTfIdfWeightOfEssay != 0:
        # Taking average of sum of vectors with tf-idf cumulative sum
        tfIdfWeightedWord2VecEssayVector = tfIdfWeightedWord2VecEssayVector / cumulativeSumTfIdfWeightOfEssay;
    # Appending the above calculated tf-idf weighted vector of particular essay to list of vectors of essays
    tfIdfWeightedWord2VecEssaysVectors.append(tfIdfWeightedWord2VecEssayVector);

In [0]:
print("Shape of Tf-Idf weighted Word2Vec vectorization matrix of project essays: {}, {}".format(len(tfIdfWeightedWord2VecEssaysVectors), len(tfIdfWeightedWord2VecEssaysVectors[0])));
equalsBorder(70);
print("Sample Essay: ");
equalsBorder(70);
print(preProcessedEssaysWithoutStopWords[0]);
equalsBorder(70);
print("Tf-Idf Weighted Word2Vec vector of sample essay: ");
equalsBorder(70);
print(tfIdfWeightedWord2VecEssaysVectors[0]);
Shape of Tf-Idf weighted Word2Vec vectorization matrix of project essays: 90848, 300
======================================================================
Sample Essay: 
======================================================================
students amazing amazing things everyday resources given school 70 free reduced lunch considered urban setting school system started school year noticing many students would benefit movement technology classroom research shows ball chairs exercise balls stand chairs help students focus fire brains asking amazing things help amazing learners lot energy funny students great leaders successful future donations make dreams become reality like move love read love lots positive attention many raised single parent households receive free lunch based socioeconomic status things may prevent getting ahead early life may not provide life experiences many us see typical minute walk door classroom focus potential growth may not able control home lives however certainly control experience school day creative positive way hopeful inspire even earliest learners continue path academic excellencenannan
======================================================================
Tf-Idf Weighted Word2Vec vector of sample essay: 
======================================================================
[-4.40281688e-02  6.26231737e-02 -4.31555427e-02 -9.01313639e-02
 -9.62350308e-03  8.03547035e-03 -3.15334366e+00  7.25903700e-02
 -9.98054048e-03 -1.18859295e-01 -2.45466881e-02  1.44637528e-02
  2.89848365e-02 -5.93626889e-02 -5.07538231e-02 -3.11436166e-03
 -9.08761211e-02 -5.31242877e-02 -3.95432651e-02  8.46004829e-03
  2.92311866e-02 -4.00495981e-02  5.22486912e-02  2.90640636e-02
  3.68616621e-03  1.42196541e-02  8.48053943e-02  1.98116967e-02
 -2.49067822e-02  3.42957045e-02 -2.52961631e-01 -5.62297509e-02
  2.10839853e-02  1.13911638e-01 -8.26764751e-02 -5.20411413e-02
 -2.61075517e-02 -8.56502451e-02  1.94286110e-02 -5.25268614e-02
 -5.17782497e-02  7.54000155e-02  2.56516224e-02 -1.09924693e-01
 -5.80767889e-03  3.13340126e-02  1.12273257e-01 -5.85698577e-02
 -6.67183961e-02 -8.61779454e-02  1.46897325e-03 -2.99907248e-02
 -2.98344860e-02  1.87979372e-02  2.71321320e-02 -8.77036994e-02
  5.18968909e-02 -1.23114304e-01  1.53663261e-02  1.08213648e-01
 -3.60242954e-02 -5.26911536e-02  5.87031362e-02  3.01983827e-02
 -1.01431900e-01  5.26983966e-02 -2.12726748e-02 -2.13978809e-02
  1.46360156e-01 -6.53763717e-02 -1.11788514e-01  1.55399380e-02
 -3.06463490e-02 -5.58739504e-02 -5.38017421e-02 -1.42917478e-01
  6.64018235e-02  6.83613885e-02  5.74426694e-02 -2.89263287e-02
  5.48084728e-02 -4.10828385e-01 -7.67436191e-03  1.33136452e-02
 -1.60595372e-01 -2.07965772e-02  1.19963511e-01 -3.85662426e-02
  1.03421571e-01 -6.21170794e-03  4.85478842e-02  8.39264787e-03
 -8.29186192e-03  2.54003840e-02 -2.78796931e-03 -1.48967065e-01
 -2.14742233e+00  5.25321330e-02  1.33588681e-01  6.27028973e-02
 -1.42402853e-01  6.74121801e-02  1.06276785e-01 -4.14299323e-02
  4.98283248e-02 -2.58041371e-02 -5.36767766e-02 -2.16199688e-01
  1.01340475e-01  6.96892344e-03 -2.67976451e-02 -9.23479044e-02
  2.17060985e-03  1.92591578e-01 -1.36234523e-02  1.49586762e-01
 -1.97199388e-01  2.58054548e-02  1.16870458e-01  3.88316061e-02
  3.50575619e-02  9.80400391e-03  3.06604664e-03 -1.29823762e-01
 -1.61527228e-02 -2.40926266e-02  6.35628369e-02 -8.97177748e-02
 -2.14193962e-02 -1.09432426e-02 -4.17332175e-02 -1.17778235e-02
 -4.87681058e-02 -4.41529110e-02 -6.53294779e-02 -1.05218403e-01
  3.41600440e-02 -8.81314681e-02  4.59597181e-02  3.10088803e-01
  6.84948550e-02  4.46874787e-02  2.30420851e-02 -7.97831090e-02
 -5.73619736e-02 -9.27882612e-03 -2.66266266e-02 -2.70094416e-03
  1.70163622e-01  6.20201119e-03 -2.02551992e-02  2.15471458e-02
  2.54672226e-02  1.18770876e-02  3.74622035e-02  8.40084393e-02
  1.97109231e-02 -4.67433638e-02 -2.69084109e-02  8.79960875e-03
  6.05426665e-02 -3.62210377e-02 -5.69850492e-02 -6.91683775e-02
 -4.84090817e-02  8.63630329e-02 -1.26502080e-01  4.40166921e-02
  1.40494072e-01 -1.03122556e-01 -7.35388733e-02  4.32693327e-02
 -9.95960465e-02 -9.06054261e-02 -3.46452005e-02  4.06514348e-02
 -2.64261783e-02 -3.57373465e-03 -1.07031700e-01 -9.32448850e-02
  4.36484452e-02  2.59717570e-01  7.11142779e-02  4.00558653e-03
 -4.51284593e-02  3.41746698e-02 -1.00908879e-01 -3.13855416e-02
  1.06449467e-01 -6.42907424e-04 -7.17312223e-03 -6.29258104e-02
 -8.59902472e-02  1.46183764e-03  2.94782923e-02 -3.38247912e-02
 -4.16268847e-02  5.32674252e-02  6.06680653e-02  6.42535837e-02
  5.96934813e-02  2.34943783e-02 -3.81674840e-02  4.47627441e-02
 -1.20017026e-01  7.24445454e-02  5.87884029e-02 -4.55980632e-02
  1.06341574e-01  2.78584176e-02  5.97945234e-03 -1.34716866e-02
 -2.55738842e-02 -1.46009602e-01 -4.16689615e-02 -2.33456898e-02
 -9.17006723e-02 -5.68391838e-02  1.04959393e-02  5.99694333e-03
 -6.84332103e-02 -4.63802874e-02 -2.51816004e-02 -8.50933453e-02
 -2.28976310e+00  1.53719499e-02  4.58609780e-02  9.65742605e-02
 -2.71001266e-02 -1.34752499e-01  1.40109336e-03 -5.88471443e-02
 -5.82388787e-02 -1.38346461e-02 -5.21083688e-02  7.00019347e-02
  1.02983216e-01 -1.00402393e-01  1.53362919e-02  9.67473772e-02
  5.78491211e-03  4.09104789e-02 -2.06881723e-01  1.22575568e-01
 -6.15251345e-02  5.84710079e-03 -6.03300346e-02  2.06089847e-02
  1.23557690e-02  4.30673813e-02  3.66308866e-02  7.01205427e-02
  4.16755603e-02 -7.18802367e-02  1.04419991e-01  5.20683620e-02
  9.94579785e-02 -7.66246356e-02  4.79323892e-02 -8.78930460e-02
  6.93868778e-03  3.71651310e-02  3.67972893e-02  1.69981642e-02
  6.34906152e-02 -4.32034223e-02 -1.02230739e-01  1.15465541e-02
  4.83474966e-02  8.80579456e-02 -6.64411462e-02  1.36157625e-03
 -1.39409242e-01  3.65293619e-02 -8.31160167e-02 -4.18185631e-02
  7.17291805e-02  1.41634733e-02 -4.65136155e-02  5.89629799e-02
  3.24561724e-01 -3.83872532e-02 -5.24027604e-02  1.15254166e-02
 -1.27966572e-02  1.17431511e-01  1.66115222e-02 -1.40325357e-02
 -2.48278867e-02 -7.44909106e-02 -4.56127493e-03 -7.23180617e-02
 -6.73544058e-02 -1.34927062e-01  1.35740982e-01  2.25520288e-02
 -3.60779768e-02  1.13646073e-01  8.46059289e-02 -1.61942043e-02]

2. Vectorizing project_title

In [0]:
# Initializing tfidf vectorizer
tfIdfTitleTempVectorizer = TfidfVectorizer();
# Vectorizing preprocessed titles using tfidf vectorizer initialized above 
tfIdfTitleTempVectorizer.fit(preProcessedProjectTitlesWithoutStopWords);
# Saving dictionary in which each word is key and it's idf is value
tfIdfTitleDictionary = dict(zip(tfIdfTitleTempVectorizer.get_feature_names(), list(tfIdfTitleTempVectorizer.idf_)));
# Creating set of all unique words used by tfidf vectorizer
tfIdfTitleWords = set(tfIdfTitleTempVectorizer.get_feature_names());
In [0]:
# Creating list to save tf-idf weighted vectors of project titles
tfIdfWeightedWord2VecTitlesVectors = [];
# Iterating over each title
for title in tqdm(preProcessedProjectTitlesWithoutStopWords):
    # Sum of tf-idf values of all words in a particular project title
    cumulativeSumTfIdfWeightOfTitle = 0;
    # Tf-Idf weighted word2vec vector of a particular project title
    tfIdfWeightedWord2VecTitleVector = np.zeros(300);
    # Splitting title into list of words
    splittedTitle = title.split();
    # Iterating over each word
    for word in splittedTitle:
        # Checking if word is in glove words and set of words used by tfIdf title vectorizer
        if (word in gloveWords) and (word in tfIdfTitleWords):
            # Tf-Idf value of particular word in title
            tfIdfValueWord = tfIdfTitleDictionary[word] * (title.count(word) / len(splittedTitle));
            # Making tf-idf weighted word2vec
            tfIdfWeightedWord2VecTitleVector += tfIdfValueWord * gloveModel[word];
            # Summing tf-idf weight of word to cumulative sum
            cumulativeSumTfIdfWeightOfTitle += tfIdfValueWord;
    if cumulativeSumTfIdfWeightOfTitle != 0:
        # Taking average of sum of vectors with tf-idf cumulative sum
        tfIdfWeightedWord2VecTitleVector = tfIdfWeightedWord2VecTitleVector / cumulativeSumTfIdfWeightOfTitle;
    # Appending the above calculated tf-idf weighted vector of particular title to list of vectors of project titles
    tfIdfWeightedWord2VecTitlesVectors.append(tfIdfWeightedWord2VecTitleVector);

In [0]:
print("Shape of Tf-Idf weighted Word2Vec vectorization matrix of project titles: {}, {}".format(len(tfIdfWeightedWord2VecTitlesVectors), len(tfIdfWeightedWord2VecTitlesVectors[0])));
equalsBorder(70);
print("Sample Title: ");
equalsBorder(70);
print(preProcessedProjectTitlesWithoutStopWords[0]);
equalsBorder(70);
print("Tf-Idf Weighted Word2Vec vector of sample title: ");
equalsBorder(70);
print(tfIdfWeightedWord2VecTitlesVectors[0]);
Shape of Tf-Idf weighted Word2Vec vectorization matrix of project titles: 90848, 300
======================================================================
Sample Title: 
======================================================================
bouncing ball braniacs ii
======================================================================
Tf-Idf Weighted Word2Vec vector of sample title: 
======================================================================
[-3.28926824e-01 -3.81643207e-02  2.69988561e-01 -2.06591199e-01
  2.14990622e-01 -1.70872725e-01 -9.84207210e-01 -1.56817811e-01
 -2.93495883e-01 -2.14411205e-01 -5.97118101e-02 -6.44405734e-02
  7.62356327e-02 -4.55620737e-01  2.23450864e-02 -1.53325103e-01
 -2.71154491e-01 -3.04669959e-02  1.86821166e-01 -3.03270067e-01
 -1.45008580e-02 -6.12075437e-02  2.78461339e-01  1.35452494e-01
 -9.86145160e-02  5.44403248e-02  1.91640931e-01 -1.33227230e-01
  7.28124805e-02  2.42683234e-01  1.00821333e-01 -1.04393514e-01
 -1.62555960e-01 -3.69859739e-01  1.27330121e-01  1.60038767e-01
  5.40554629e-02 -1.12613848e-01  2.49907544e-01  9.39900445e-02
  8.57757906e-02  3.01162973e-01 -1.18822075e-01 -2.55977075e-02
 -2.58304813e-02 -1.03824786e-01  2.29574422e-01  6.92130139e-02
 -1.05409216e-01  2.29934475e-01  1.58916103e-01  8.35545053e-02
 -1.87401706e-02  5.24536828e-02 -2.22774259e-01  4.34785227e-02
 -5.96266472e-02 -6.66790662e-02  1.91447757e-01 -6.13988346e-02
  2.69675403e-01  1.96594955e-01 -7.23132236e-02 -4.29107751e-02
  1.67430985e-01  2.28200701e-01 -2.17306095e-01  1.29857909e-01
  1.26679865e-01  2.01644733e-01  9.48328162e-03 -2.86797786e-02
  2.65517043e-02  4.92548946e-02  9.54604893e-02 -3.50088690e-01
  2.43593985e-01 -1.31371327e-01  1.69991385e-01 -1.44493940e-01
 -1.48432593e-01 -2.67116143e-02 -4.00062903e-02 -5.65822099e-02
 -1.02176202e-02  5.04066189e-02  6.44624543e-02  3.24855651e-02
 -2.16677940e-02 -5.45253576e-02  1.02274754e-01 -6.94188761e-02
  6.40785706e-02 -2.70147220e-01 -4.58441365e-02  1.68025966e-01
 -9.19938849e-01  2.55115721e-01 -4.55811521e-02  9.19315439e-02
 -1.80948739e-01 -9.70284253e-02  1.32114679e-02  5.70553152e-02
 -1.10585083e-01 -1.19928057e-01 -8.51049467e-02 -1.85523746e-01
  6.29128222e-02  2.97679620e-02 -8.01009944e-02  2.25932905e-02
  2.59497326e-02  2.10218153e-01  2.67292378e-01  7.33515349e-02
 -2.19194682e-01 -2.80822323e-01 -7.29874594e-02  1.16249852e-01
 -1.12797883e-01  1.27045495e-01 -9.65135115e-02 -6.96650041e-02
 -1.32956118e-02 -8.66943492e-02 -8.75162659e-02  1.47267996e-01
  1.12531427e-02 -1.54050637e-01 -2.05408265e-01 -1.34087400e-01
  8.25264937e-02 -3.67975675e-01 -4.22357069e-02 -1.57287261e-03
 -1.62509846e-01 -3.53289135e-02  1.98558289e-02  3.62575521e-01
  2.96110241e-02  1.00351351e-01  3.27983083e-01  1.57503574e-01
  2.17037042e-01  1.31340911e-01 -1.74360233e-01 -2.18516644e-01
 -2.57795102e-01  9.62793523e-02  1.24336716e-01  4.19819833e-01
  8.89854415e-02 -1.04766969e-01  2.17617393e-01  1.00821930e-01
 -1.41601857e-01 -1.95270393e-02 -1.47555171e-02 -7.99524905e-02
 -6.06611318e-02  1.48119808e-01 -2.65089202e-01  1.12528509e-02
 -1.71804659e-01  1.73278826e-01 -8.88518493e-02  2.92491823e-01
  2.08325964e-02 -8.37870511e-02 -2.96998188e-01 -5.70108848e-02
 -1.31566889e-02  1.54750349e-01  7.71824692e-02 -3.62250886e-02
 -5.27062649e-02 -8.35544418e-02  4.91279422e-02 -9.52110369e-02
 -5.80743420e-02 -2.04558765e-01  6.36335373e-02 -2.51735508e-03
  2.85705242e-01  4.27654703e-02  8.14198281e-02 -5.55946613e-02
 -3.28487228e-03 -4.85994974e-02 -7.47081870e-02  1.27909209e-01
 -4.17077408e-01 -3.25363690e-02  2.30231701e-01  2.19628576e-01
  5.01627397e-02 -6.54387896e-02 -2.34835327e-01 -1.12894263e-02
  8.39524651e-02 -8.17526557e-02  9.39511284e-02  1.71031536e-03
 -2.30946220e-01  1.41026701e-01 -3.78727169e-01 -4.52433335e-02
 -1.73850196e-02 -9.65146351e-02  6.92539402e-02 -1.25630126e-01
 -3.20151032e-01 -2.18387371e-01 -1.59081700e-02 -1.09366095e-01
 -2.48393836e-01  9.64961150e-02  2.52275744e-01 -7.98590149e-02
  2.72208297e-02 -3.42908461e-02 -8.55390829e-02  1.85743948e-01
 -1.54811713e+00 -1.82484595e-01 -7.99826824e-02  5.44064100e-02
 -1.34222011e-01  7.25705361e-03 -1.09551359e-01 -1.92770238e-01
  1.15294951e-01  2.25122494e-01 -2.00939977e-01  2.01506829e-01
  2.75177469e-01 -8.35940896e-02  2.31542614e-02 -1.32512316e-01
  1.04670037e-01 -2.89770702e-02  1.83352953e-01  2.33693337e-01
 -1.00432853e-01  2.82729876e-01  2.18245892e-01  2.39864700e-01
 -1.69431986e-01  2.48638599e-02  3.31126041e-02  7.92332653e-02
 -1.06651590e-01 -1.72737295e-01  1.49791747e-01  3.46148666e-02
 -2.35612886e-03 -1.24593763e-02  7.13918040e-02 -1.17488150e-01
 -5.07229221e-02  1.18700891e-03  3.80940529e-02  1.90553309e-01
  1.95583354e-02 -1.30117106e-01 -5.51304125e-03 -1.12565820e-02
 -2.05628941e-01  2.92417430e-01  1.79992009e-01 -2.26730336e-02
  5.35481857e-02  2.12067463e-02  2.53027211e-01  4.30345477e-02
  4.49364129e-02  4.73489927e-01 -1.15395393e-01 -1.99714113e-01
  5.44589259e-01  1.23026535e-01  9.00187979e-02 -5.89941851e-02
 -2.51463037e-01 -3.05086057e-01  2.25405511e-01  3.32462702e-01
 -3.02860677e-02  8.73037358e-02 -1.80244090e-01  1.03970184e-01
 -1.52043748e-01  1.12873638e-01 -9.66281911e-02  5.78059274e-02
  2.59570014e-01  1.96500406e-02 -7.45492993e-02  1.18793861e-01]

Method for vectorizing unknown essays using our training data tf-idf weighted model

In [0]:
def getAvgTfIdfEssayVectors(arrayOfTexts):
    # Creating list to save tf-idf weighted vectors of essays
    tfIdfWeightedWord2VecEssaysVectors = [];
    # Iterating over each essay
    for essay in tqdm(arrayOfTexts):
        # Sum of tf-idf values of all words in a particular essay
        cumulativeSumTfIdfWeightOfEssay = 0;
        # Tf-Idf weighted word2vec vector of a particular essay
        tfIdfWeightedWord2VecEssayVector = np.zeros(300);
        # Splitting essay into list of words
        splittedEssay = essay.split();
        # Iterating over each word
        for word in splittedEssay:
            # Checking if word is in glove words and set of words used by tfIdf essay vectorizer
            if (word in gloveWords) and (word in tfIdfEssayWords):
                # Tf-Idf value of particular word in essay
                tfIdfValueWord = tfIdfEssayDictionary[word] * (essay.count(word) / len(splittedEssay));
                # Making tf-idf weighted word2vec
                tfIdfWeightedWord2VecEssayVector += tfIdfValueWord * gloveModel[word];
                # Summing tf-idf weight of word to cumulative sum
                cumulativeSumTfIdfWeightOfEssay += tfIdfValueWord;
        if cumulativeSumTfIdfWeightOfEssay != 0:
            # Taking average of sum of vectors with tf-idf cumulative sum
            tfIdfWeightedWord2VecEssayVector = tfIdfWeightedWord2VecEssayVector / cumulativeSumTfIdfWeightOfEssay;
        # Appending the above calculated tf-idf weighted vector of particular essay to list of vectors of essays
        tfIdfWeightedWord2VecEssaysVectors.append(tfIdfWeightedWord2VecEssayVector);
    return tfIdfWeightedWord2VecEssaysVectors;

Method for vectorizing unknown titles using our training data tf-idf weighted model

In [0]:
def getAvgTfIdfTitleVectors(arrayOfTexts):
    # Creating list to save tf-idf weighted vectors of project titles
    tfIdfWeightedWord2VecTitlesVectors = [];
    # Iterating over each title
    for title in tqdm(arrayOfTexts):
        # Sum of tf-idf values of all words in a particular project title
        cumulativeSumTfIdfWeightOfTitle = 0;
        # Tf-Idf weighted word2vec vector of a particular project title
        tfIdfWeightedWord2VecTitleVector = np.zeros(300);
        # Splitting title into list of words
        splittedTitle = title.split();
        # Iterating over each word
        for word in splittedTitle:
            # Checking if word is in glove words and set of words used by tfIdf title vectorizer
            if (word in gloveWords) and (word in tfIdfTitleWords):
                # Tf-Idf value of particular word in title
                tfIdfValueWord = tfIdfTitleDictionary[word] * (title.count(word) / len(splittedTitle));
                # Making tf-idf weighted word2vec
                tfIdfWeightedWord2VecTitleVector += tfIdfValueWord * gloveModel[word];
                # Summing tf-idf weight of word to cumulative sum
                cumulativeSumTfIdfWeightOfTitle += tfIdfValueWord;
        if cumulativeSumTfIdfWeightOfTitle != 0:
            # Taking average of sum of vectors with tf-idf cumulative sum
            tfIdfWeightedWord2VecTitleVector = tfIdfWeightedWord2VecTitleVector / cumulativeSumTfIdfWeightOfTitle;
        # Appending the above calculated tf-idf weighted vector of particular title to list of vectors of project titles
        tfIdfWeightedWord2VecTitlesVectors.append(tfIdfWeightedWord2VecTitleVector);
    return tfIdfWeightedWord2VecTitlesVectors;

Vectorizing numerical features

1. Vectorizing price

In [0]:
# Standardizing the price data using StandardScaler(Uses mean and std for standardization)
priceScaler = MinMaxScaler();
priceScaler.fit(trainingData['price'].values.reshape(-1, 1));
priceStandardized = priceScaler.transform(trainingData['price'].values.reshape(-1, 1));
In [0]:
print("Shape of standardized matrix of prices: ", priceStandardized.shape);
equalsBorder(70);
print("Sample original prices: ");
equalsBorder(70);
print(trainingData['price'].values[0:5]);
print("Sample standardized prices: ");
equalsBorder(70);
print(priceStandardized[0:5]);
Shape of standardized matrix of prices:  (90848, 1)
======================================================================
Sample original prices: 
======================================================================
[129.98 462.97 239.94  99.97 927.49]
Sample standardized prices: 
======================================================================
[[0.01293415]
 [0.04623868]
 [0.02393197]
 [0.00993265]
 [0.09269839]]

2. Vectorizing quantity

In [0]:
# Standardizing the quantity data using StandardScaler(Uses mean and std for standardization)
quantityScaler = MinMaxScaler();
quantityScaler.fit(trainingData['quantity'].values.reshape(-1, 1));
quantityStandardized = quantityScaler.transform(trainingData['quantity'].values.reshape(-1, 1));
In [0]:
print("Shape of standardized matrix of quantities: ", quantityStandardized.shape);
equalsBorder(70);
print("Sample original quantities: ");
equalsBorder(70);
print(trainingData['quantity'].values[0:5]);
print("Sample standardized quantities: ");
equalsBorder(70);
print(quantityStandardized[0:5]);
Shape of standardized matrix of quantities:  (90848, 1)
======================================================================
Sample original quantities: 
======================================================================
[2 7 8 6 6]
Sample standardized quantities: 
======================================================================
[[0.00107643]
 [0.00645856]
 [0.00753498]
 [0.00538213]
 [0.00538213]]

3. Vectorizing teacher_number_of_previously_posted_projects

In [0]:
# Standardizing the teacher_number_of_previously_posted_projects data using StandardScaler(Uses mean and std for standardization)
previouslyPostedScaler = MinMaxScaler();
previouslyPostedScaler.fit(trainingData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
previouslyPostedStandardized = previouslyPostedScaler.transform(trainingData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
In [0]:
print("Shape of standardized matrix of teacher_number_of_previously_posted_projects: ", previouslyPostedStandardized.shape);
equalsBorder(70);
print("Sample original quantities: ");
equalsBorder(70);
print(trainingData['teacher_number_of_previously_posted_projects'].values[0:5]);
print("Sample standardized teacher_number_of_previously_posted_projects: ");
equalsBorder(70);
print(previouslyPostedStandardized[0:5]);
Shape of standardized matrix of teacher_number_of_previously_posted_projects:  (90848, 1)
======================================================================
Sample original quantities: 
======================================================================
[ 1  0  0 20 13]
Sample standardized teacher_number_of_previously_posted_projects: 
======================================================================
[[0.00221729]
 [0.        ]
 [0.        ]
 [0.0443459 ]
 [0.02882483]]
In [0]:
numberOfPoints = previouslyPostedStandardized.shape[0];
# Categorical data
categoriesVectorsSub = categoriesVectors[0:numberOfPoints];
subCategoriesVectorsSub = subCategoriesVectors[0:numberOfPoints];
teacherPrefixVectorsSub = teacherPrefixVectors[0:numberOfPoints];
schoolStateVectorsSub = schoolStateVectors[0:numberOfPoints];
projectGradeVectorsSub = projectGradeVectors[0:numberOfPoints];

# Text data
bowEssayModelSub = bowEssayModel[0:numberOfPoints];
bowTitleModelSub = bowTitleModel[0:numberOfPoints];
tfIdfEssayModelSub = tfIdfEssayModel[0:numberOfPoints];
tfIdfTitleModelSub = tfIdfTitleModel[0:numberOfPoints];

# Numerical data
priceStandardizedSub = priceStandardized[0:numberOfPoints];
quantityStandardizedSub = quantityStandardized[0:numberOfPoints];
previouslyPostedStandardizedSub = previouslyPostedStandardized[0:numberOfPoints];

# Classes
classesTrainingSub = classesTraining;
In [0]:
supportVectorMachineResultsDataFrame = pd.DataFrame(columns =  ['Vectorizer', 'Model', 'Hyper Parameter - alpha', 'AUC', 'Data']);
supportVectorMachineResultsDataFrame
Out[0]:
Vectorizer Model Hyper Parameter - alpha AUC Data

Preparing cross validate data for analysis

In [0]:
# Test data categorical features transformation 
categoriesTransformedCrossValidateData = subjectsCategoriesVectorizer.transform(crossValidateData['cleaned_categories']);
subCategoriesTransformedCrossValidateData = subjectsSubCategoriesVectorizer.transform(crossValidateData['cleaned_sub_categories']);
teacherPrefixTransformedCrossValidateData = teacherPrefixVectorizer.transform(crossValidateData['teacher_prefix']);
schoolStateTransformedCrossValidateData = schoolStateVectorizer.transform(crossValidateData['school_state']);
projectGradeTransformedCrossValidateData = projectGradeVectorizer.transform(crossValidateData['project_grade_category']);

# Test data text features transformation
preProcessedEssaysTemp = preProcessingWithAndWithoutStopWords(crossValidateData['project_essay'])[1];
preProcessedTitlesTemp = preProcessingWithAndWithoutStopWords(crossValidateData['project_title'])[1];
bowEssayTransformedCrossValidateData = bowEssayVectorizer.transform(preProcessedEssaysTemp);
bowTitleTransformedCrossValidateData = bowTitleVectorizer.transform(preProcessedTitlesTemp);
tfIdfEssayTransformedCrossValidateData = tfIdfEssayVectorizer.transform(preProcessedEssaysTemp);
tfIdfTitleTransformedCrossValidateData = tfIdfTitleVectorizer.transform(preProcessedTitlesTemp);
avgWord2VecEssayTransformedCrossValidateData = getWord2VecVectors(preProcessedEssaysTemp);
avgWord2VecTitleTransformedCrossValidateData = getWord2VecVectors(preProcessedTitlesTemp);
tfIdfWeightedWord2VecEssayTransformedCrossValidateData = getAvgTfIdfEssayVectors(preProcessedEssaysTemp);
tfIdfWeightedWord2VecTitleTransformedCrossValidateData = getAvgTfIdfTitleVectors(preProcessedTitlesTemp);


# Test data numerical features transformation
priceTransformedCrossValidateData = priceScaler.transform(crossValidateData['price'].values.reshape(-1, 1));
quantityTransformedCrossValidateData = quantityScaler.transform(crossValidateData['quantity'].values.reshape(-1, 1));
previouslyPostedTransformedCrossValidateData = previouslyPostedScaler.transform(crossValidateData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));






Preparing Test data for analysis

In [0]:
# Test data categorical features transformation 
categoriesTransformedTestData = subjectsCategoriesVectorizer.transform(testData['cleaned_categories']);
subCategoriesTransformedTestData = subjectsSubCategoriesVectorizer.transform(testData['cleaned_sub_categories']);
teacherPrefixTransformedTestData = teacherPrefixVectorizer.transform(testData['teacher_prefix']);
schoolStateTransformedTestData = schoolStateVectorizer.transform(testData['school_state']);
projectGradeTransformedTestData = projectGradeVectorizer.transform(testData['project_grade_category']);

# Test data text features transformation
preProcessedEssaysTemp = preProcessingWithAndWithoutStopWords(testData['project_essay'])[1];
preProcessedTitlesTemp = preProcessingWithAndWithoutStopWords(testData['project_title'])[1];
bowEssayTransformedTestData = bowEssayVectorizer.transform(preProcessedEssaysTemp);
bowTitleTransformedTestData = bowTitleVectorizer.transform(preProcessedTitlesTemp);
tfIdfEssayTransformedTestData = tfIdfEssayVectorizer.transform(preProcessedEssaysTemp);
tfIdfTitleTransformedTestData = tfIdfTitleVectorizer.transform(preProcessedTitlesTemp);
avgWord2VecEssayTransformedTestData = getWord2VecVectors(preProcessedEssaysTemp);
avgWord2VecTitleTransformedTestData = getWord2VecVectors(preProcessedTitlesTemp);
tfIdfWeightedWord2VecEssayTransformedTestData = getAvgTfIdfEssayVectors(preProcessedEssaysTemp);
tfIdfWeightedWord2VecTitleTransformedTestData = getAvgTfIdfTitleVectors(preProcessedTitlesTemp);

# Test data numerical features transformation
priceTransformedTestData = priceScaler.transform(testData['price'].values.reshape(-1, 1));
quantityTransformedTestData = quantityScaler.transform(testData['quantity'].values.reshape(-1, 1));
previouslyPostedTransformedTestData = previouslyPostedScaler.transform(testData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));






Classification using imbalanced data by support vector machine

In [0]:
techniques = ['Bag of words', 'Tf-Idf', 'Average Word2Vector', 'Tf-Idf Weighted Word2Vector'];
for index, technique in enumerate(techniques):
    trainingMergedData = hstack((categoriesVectorsSub,\
                                     subCategoriesVectorsSub,\
                                     teacherPrefixVectorsSub,\
                                     schoolStateVectorsSub,\
                                     projectGradeVectorsSub,\
                                     priceStandardizedSub,\
                                     previouslyPostedStandardizedSub));
    crossValidateMergedData = hstack((categoriesTransformedCrossValidateData,\
                                          subCategoriesTransformedCrossValidateData,\
                                          teacherPrefixTransformedCrossValidateData,\
                                          schoolStateTransformedCrossValidateData,\
                                          projectGradeTransformedCrossValidateData,\
                                          priceTransformedCrossValidateData,\
                                          previouslyPostedTransformedCrossValidateData));
    testMergedData = hstack((categoriesTransformedTestData,\
                                          subCategoriesTransformedTestData,\
                                          teacherPrefixTransformedTestData,\
                                          schoolStateTransformedTestData,\
                                          projectGradeTransformedTestData,\
                                          priceTransformedTestData,\
                                          previouslyPostedTransformedTestData));
    if(index == 0):
        trainingMergedData = hstack((trainingMergedData,\
                                     bowTitleModelSub,\
                                     bowEssayModelSub));
        crossValidateMergedData = hstack((crossValidateMergedData,\
                                 bowTitleTransformedCrossValidateData,\
                                 bowEssayTransformedCrossValidateData));
        testMergedData = hstack((testMergedData,\
                                 bowTitleTransformedTestData,\
                                 bowEssayTransformedTestData));
    elif(index == 1):
        trainingMergedData = hstack((trainingMergedData,\
                                     tfIdfTitleModelSub,\
                                     tfIdfEssayModelSub));
        crossValidateMergedData = hstack((crossValidateMergedData,\
                                 tfIdfTitleTransformedCrossValidateData,\
                                 tfIdfEssayTransformedCrossValidateData));
        testMergedData = hstack((testMergedData,\
                                 tfIdfTitleTransformedTestData,\
                                 tfIdfEssayTransformedTestData));
    elif(index == 2):
        trainingMergedData = hstack((trainingMergedData,\
                                     word2VecTitlesVectors,\
                                     word2VecEssaysVectors));
        crossValidateMergedData = hstack((crossValidateMergedData,\
                                 avgWord2VecTitleTransformedCrossValidateData,\
                                 avgWord2VecEssayTransformedCrossValidateData));
        testMergedData = hstack((testMergedData,\
                                 avgWord2VecTitleTransformedTestData,\
                                 avgWord2VecEssayTransformedTestData));
    elif(index == 3):
        trainingMergedData = hstack((trainingMergedData,\
                                     tfIdfWeightedWord2VecTitlesVectors,\
                                     tfIdfWeightedWord2VecEssaysVectors));
        crossValidateMergedData = hstack((crossValidateMergedData,\
                                 tfIdfWeightedWord2VecTitleTransformedCrossValidateData,\
                                 tfIdfWeightedWord2VecEssayTransformedCrossValidateData));
        testMergedData = hstack((testMergedData,\
                                 tfIdfWeightedWord2VecTitleTransformedTestData,\
                                 tfIdfWeightedWord2VecEssayTransformedTestData));
    
    svmClassifier = linear_model.SGDClassifier(loss = 'hinge');
    tunedParameters = {'alpha': [0.0001, 0.01, 0.1, 1, 10, 100, 10000]};
    classifier = GridSearchCV(svmClassifier, tunedParameters, cv = 5, scoring = 'roc_auc');
    classifier.fit(trainingMergedData, classesTrainingSub);
    
    trainingAucMeanValues = classifier.cv_results_['mean_train_score'];
    trainingAucStdValues = classifier.cv_results_['std_train_score'];
    crossValidateAucMeanValues = classifier.cv_results_['mean_test_score'];
    crossValidateAucStdValues = classifier.cv_results_['std_test_score'];
    
    plt.plot(tunedParameters['alpha'], trainingAucMeanValues, 'b', label = "Training AUC");
    plt.plot(tunedParameters['alpha'], crossValidateAucMeanValues, label = "Cross Validate AUC");
    plt.scatter(tunedParameters['alpha'], trainingAucMeanValues, label = 'Training AUC values');
    plt.scatter(tunedParameters['alpha'], crossValidateAucMeanValues, label = ['Cross validate AUC values']);
    plt.gca().fill_between(tunedParameters['alpha'], trainingAucMeanValues - trainingAucStdValues, trainingAucMeanValues + trainingAucStdValues, alpha = 0.2, color = 'darkblue');
    plt.gca().fill_between(tunedParameters['alpha'], crossValidateAucMeanValues - crossValidateAucStdValues, crossValidateAucMeanValues + crossValidateAucStdValues, alpha = 0.2, color = 'darkorange');
    plt.xlabel('Hyper parameter: alpha values');
    plt.ylabel('Scoring: AUC values');
    plt.grid();
    plt.legend();
    plt.show();
    
    optimalHypParamValue = classifier.best_params_['alpha'];
    svmClassifier = linear_model.SGDClassifier(loss = 'hinge', alpha = optimalHypParamValue);
    svmClassifier.fit(trainingMergedData, classesTrainingSub);
    predScoresTraining = svmClassifier.predict(trainingMergedData);
    fprTrain, tprTrain, thresholdTrain = roc_curve(classesTraining, predScoresTraining);
    predScoresTest = svmClassifier.predict(testMergedData);
    fprTest, tprTest, thresholdTest = roc_curve(classesTest, predScoresTest);
    
    plt.plot(fprTrain, tprTrain, label = "Train AUC = " + str(auc(fprTrain, tprTrain)));
    plt.plot(fprTest, tprTest, label = "Test AUC = " + str(auc(fprTest, tprTest)));
    plt.plot([0, 1], [0, 1], 'k-');
    plt.xlabel("fpr values");
    plt.ylabel("tpr values");
    plt.grid();
    plt.legend();
    plt.show();
    
    areaUnderRocValueTest = auc(fprTest, tprTest);
    
    print("Results of analysis using {} vectorized text features merged with other features using support vector machine classifier: ".format(technique));
    equalsBorder(70);
    print("AUC values of train data: ");
    equalsBorder(40);
    print(trainingAucMeanValues);
    equalsBorder(40);
    print("Optimal Hyper parameter Value: ", optimalHypParamValue);
    equalsBorder(40);
    print("AUC value of test data: ", str(areaUnderRocValueTest));
    # Predicting classes of test data projects
    predictionClassesTest = svmClassifier.predict(testMergedData);
    equalsBorder(40);
    # Printing confusion matrix
    confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
    # Creating dataframe for generated confusion matrix
    confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
    print("Confusion Matrix : ");
    equalsBorder(60);
    sbrn.heatmap(confusionMatrixDataFrame, annot = True, fmt = 'd', cmap="YlGnBu");
    plt.show();
    # Adding results to results dataframe
    supportVectorMachineResultsDataFrame = supportVectorMachineResultsDataFrame.append({'Vectorizer': technique, 'Model': 'SVM(SGD - hinge loss)', 'Hyper Parameter - alpha': optimalHypParamValue, 'AUC': areaUnderRocValueTest}, ignore_index = True);
Results of analysis using Bag of words vectorized text features merged with other features using support vector machine classifier: 
======================================================================
AUC values of train data: 
========================================
[0.76005059 0.76745118 0.67453951 0.59326364 0.56023526 0.55369242
 0.55353489]
========================================
Optimal Hyper parameter Value:  0.01
========================================
AUC value of test data:  0.5
========================================
Confusion Matrix : 
============================================================
Results of analysis using Tf-Idf vectorized text features merged with other features using support vector machine classifier: 
======================================================================
AUC values of train data: 
========================================
[0.78754953 0.75817599 0.53398948 0.53304417 0.53049886 0.53049896
 0.53049893]
========================================
Optimal Hyper parameter Value:  0.0001
========================================
AUC value of test data:  0.5049873975548939
========================================
Confusion Matrix : 
============================================================
Results of analysis using Average Word2Vector vectorized text features merged with other features using support vector machine classifier: 
======================================================================
AUC values of train data: 
========================================
[0.6726658  0.61639933 0.51775173 0.47374112 0.47275041 0.47514971
 0.47514966]
========================================
Optimal Hyper parameter Value:  0.0001
========================================
AUC value of test data:  0.5822205894218779
========================================
Confusion Matrix : 
============================================================
Results of analysis using Tf-Idf Weighted Word2Vector vectorized text features merged with other features using support vector machine classifier: 
======================================================================
AUC values of train data: 
========================================
[0.67526927 0.63545661 0.53822968 0.47581433 0.4684103  0.47027674
 0.47027681]
========================================
Optimal Hyper parameter Value:  0.0001
========================================
AUC value of test data:  0.5821593255989109
========================================
Confusion Matrix : 
============================================================

Classification using balanced data by support vector machine

In [0]:
techniques = ['Bag of words', 'Tf-Idf', 'Average Word2Vector', 'Tf-Idf Weighted Word2Vector'];
for index, technique in enumerate(techniques):
    trainingMergedData = hstack((categoriesVectorsSub,\
                                     subCategoriesVectorsSub,\
                                     teacherPrefixVectorsSub,\
                                     schoolStateVectorsSub,\
                                     projectGradeVectorsSub,\
                                     priceStandardizedSub,\
                                     previouslyPostedStandardizedSub));
    crossValidateMergedData = hstack((categoriesTransformedCrossValidateData,\
                                          subCategoriesTransformedCrossValidateData,\
                                          teacherPrefixTransformedCrossValidateData,\
                                          schoolStateTransformedCrossValidateData,\
                                          projectGradeTransformedCrossValidateData,\
                                          priceTransformedCrossValidateData,\
                                          previouslyPostedTransformedCrossValidateData));
    testMergedData = hstack((categoriesTransformedTestData,\
                                          subCategoriesTransformedTestData,\
                                          teacherPrefixTransformedTestData,\
                                          schoolStateTransformedTestData,\
                                          projectGradeTransformedTestData,\
                                          priceTransformedTestData,\
                                          previouslyPostedTransformedTestData));
    if(index == 0):
        trainingMergedData = hstack((trainingMergedData,\
                                     bowTitleModelSub,\
                                     bowEssayModelSub));
        crossValidateMergedData = hstack((crossValidateMergedData,\
                                 bowTitleTransformedCrossValidateData,\
                                 bowEssayTransformedCrossValidateData));
        testMergedData = hstack((testMergedData,\
                                 bowTitleTransformedTestData,\
                                 bowEssayTransformedTestData));
    elif(index == 1):
        trainingMergedData = hstack((trainingMergedData,\
                                     tfIdfTitleModelSub,\
                                     tfIdfEssayModelSub));
        crossValidateMergedData = hstack((crossValidateMergedData,\
                                 tfIdfTitleTransformedCrossValidateData,\
                                 tfIdfEssayTransformedCrossValidateData));
        testMergedData = hstack((testMergedData,\
                                 tfIdfTitleTransformedTestData,\
                                 tfIdfEssayTransformedTestData));
    elif(index == 2):
        trainingMergedData = hstack((trainingMergedData,\
                                     word2VecTitlesVectors,\
                                     word2VecEssaysVectors));
        crossValidateMergedData = hstack((crossValidateMergedData,\
                                 avgWord2VecTitleTransformedCrossValidateData,\
                                 avgWord2VecEssayTransformedCrossValidateData));
        testMergedData = hstack((testMergedData,\
                                 avgWord2VecTitleTransformedTestData,\
                                 avgWord2VecEssayTransformedTestData));
    elif(index == 3):
        trainingMergedData = hstack((trainingMergedData,\
                                     tfIdfWeightedWord2VecTitlesVectors,\
                                     tfIdfWeightedWord2VecEssaysVectors));
        crossValidateMergedData = hstack((crossValidateMergedData,\
                                 tfIdfWeightedWord2VecTitleTransformedCrossValidateData,\
                                 tfIdfWeightedWord2VecEssayTransformedCrossValidateData));
        testMergedData = hstack((testMergedData,\
                                 tfIdfWeightedWord2VecTitleTransformedTestData,\
                                 tfIdfWeightedWord2VecEssayTransformedTestData));
    
    svmClassifier = linear_model.SGDClassifier(loss = 'hinge');
    tunedParameters = {'alpha': [0.0001, 0.01, 0.1, 1, 10, 100, 10000]};
    classifier = GridSearchCV(svmClassifier, tunedParameters, cv = 5, scoring = 'roc_auc');
    classifier.fit(trainingMergedData, classesTrainingSub);
    
    trainingAucMeanValues = classifier.cv_results_['mean_train_score'];
    trainingAucStdValues = classifier.cv_results_['std_train_score'];
    crossValidateAucMeanValues = classifier.cv_results_['mean_test_score'];
    crossValidateAucStdValues = classifier.cv_results_['std_test_score'];
    
    plt.plot(tunedParameters['alpha'], trainingAucMeanValues, 'b', label = "Training AUC");
    plt.plot(tunedParameters['alpha'], crossValidateAucMeanValues, label = "Cross Validate AUC");
    plt.scatter(tunedParameters['alpha'], trainingAucMeanValues, label = 'Training AUC values');
    plt.scatter(tunedParameters['alpha'], crossValidateAucMeanValues, label = ['Cross validate AUC values']);
    plt.gca().fill_between(tunedParameters['alpha'], trainingAucMeanValues - trainingAucStdValues, trainingAucMeanValues + trainingAucStdValues, alpha = 0.2, color = 'darkblue');
    plt.gca().fill_between(tunedParameters['alpha'], crossValidateAucMeanValues - crossValidateAucStdValues, crossValidateAucMeanValues + crossValidateAucStdValues, alpha = 0.2, color = 'darkorange');
    plt.xlabel('Hyper parameter: alpha values');
    plt.ylabel('Scoring: AUC values');
    plt.grid();
    plt.legend();
    plt.show();
    
    optimalHypParamValue = classifier.best_params_['alpha'];
    svmClassifier = linear_model.SGDClassifier(loss = 'hinge', alpha = optimalHypParamValue);
    svmClassifier.fit(trainingMergedData, classesTrainingSub);
    predScoresTraining = svmClassifier.predict(trainingMergedData);
    fprTrain, tprTrain, thresholdTrain = roc_curve(classesTraining, predScoresTraining);
    predScoresTest = svmClassifier.predict(testMergedData);
    fprTest, tprTest, thresholdTest = roc_curve(classesTest, predScoresTest);
    
    plt.plot(fprTrain, tprTrain, label = "Train AUC = " + str(auc(fprTrain, tprTrain)));
    plt.plot(fprTest, tprTest, label = "Test AUC = " + str(auc(fprTest, tprTest)));
    plt.plot([0, 1], [0, 1], 'k-');
    plt.xlabel("fpr values");
    plt.ylabel("tpr values");
    plt.grid();
    plt.legend();
    plt.show();
    
    areaUnderRocValueTest = auc(fprTest, tprTest);
    
    print("Results of analysis using {} vectorized text features merged with other features using support vector machine classifier: ".format(technique));
    equalsBorder(70);
    print("AUC values of train data: ");
    equalsBorder(40);
    print(trainingAucMeanValues);
    equalsBorder(40);
    print("Optimal Hyper parameter Value: ", optimalHypParamValue);
    equalsBorder(40);
    print("AUC value of test data: ", str(areaUnderRocValueTest));
    # Predicting classes of test data projects
    predictionClassesTest = svmClassifier.predict(testMergedData);
    equalsBorder(40);
    # Printing confusion matrix
    confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
    # Creating dataframe for generated confusion matrix
    confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
    print("Confusion Matrix : ");
    equalsBorder(60);
    sbrn.heatmap(confusionMatrixDataFrame, annot = True, fmt = 'd', cmap="YlGnBu");
    plt.show();
    # Adding results to results dataframe
    supportVectorMachineResultsDataFrame = supportVectorMachineResultsDataFrame.append({'Vectorizer': technique, 'Model': 'SVM(SGD - hinge loss)', 'Hyper Parameter - alpha': optimalHypParamValue, 'AUC': areaUnderRocValueTest}, ignore_index = True);
Results of analysis using Bag of words vectorized text features merged with other features using support vector machine classifier: 
======================================================================
AUC values of train data: 
========================================
[0.7990519  0.79822896 0.72636534 0.6646498  0.61644424 0.61631539
 0.61631694]
========================================
Optimal Hyper parameter Value:  0.01
========================================
AUC value of test data:  0.6599121188716968
========================================
Confusion Matrix : 
============================================================
Results of analysis using Tf-Idf vectorized text features merged with other features using support vector machine classifier: 
======================================================================
AUC values of train data: 
========================================
[0.82481272 0.61018014 0.56713458 0.56687523 0.5668752  0.56687553
 0.56687553]
========================================
Optimal Hyper parameter Value:  0.0001
========================================
AUC value of test data:  0.644182556252285
========================================
Confusion Matrix : 
============================================================
Results of analysis using Average Word2Vector vectorized text features merged with other features using support vector machine classifier: 
======================================================================
AUC values of train data: 
========================================
[0.70362207 0.67508357 0.61620389 0.60466162 0.60464408 0.60464378
 0.60464493]
========================================
Optimal Hyper parameter Value:  0.0001
========================================
AUC value of test data:  0.5968317829816302
========================================
Confusion Matrix : 
============================================================
Results of analysis using Tf-Idf Weighted Word2Vector vectorized text features merged with other features using support vector machine classifier: 
======================================================================
AUC values of train data: 
========================================
[0.69795679 0.697785   0.63788102 0.62214486 0.62211589 0.62211558
 0.62211539]
========================================
Optimal Hyper parameter Value:  0.01
========================================
AUC value of test data:  0.6249861611211157
========================================
Confusion Matrix : 
============================================================

Classification using data with reduced dimensions by support vector machine

In [121]:
projectsData.shape
Out[121]:
(109245, 30)

Calculating number of words of title and essay

In [0]:
number_of_words_in_title = [len(title.split()) for title in projectsData['preprocessed_titles'].values]
number_of_words_in_essay = [len(essay.split()) for essay in projectsData['preprocessed_essays'].values]
projectsData['number_of_words_in_title'] = number_of_words_in_title;
projectsData['number_of_words_in_essay'] = number_of_words_in_essay;

Calculating sentiment score of each essay

In [123]:
sentimentAnalyzer = SentimentIntensityAnalyzer();
positiveSentimentScores = [];
negativeSentimentScores = [];
neutralSentimentScores = [];
compoundSentimentScores = [];
for projectEssay in tqdm(projectsData['preprocessed_essays'].values):
  sentimentScore = sentimentAnalyzer.polarity_scores(projectEssay);
  positiveSentimentScores.append(sentimentScore['pos']);
  negativeSentimentScores.append(sentimentScore['neg']);
  neutralSentimentScores.append(sentimentScore['neu']);
  compoundSentimentScores.append(sentimentScore['compound']);
print(len(positiveSentimentScores), len(negativeSentimentScores), len(neutralSentimentScores), len(compoundSentimentScores));
print(positiveSentimentScores[0:5])
109245 109245 109245 109245
[0.154, 0.305, 0.23, 0.256, 0.151]
In [124]:
projectsData['positive_sentiment_score'] = positiveSentimentScores;
projectsData['negative_sentiment_score'] = negativeSentimentScores;
projectsData['neutral_sentiment_score'] = neutralSentimentScores;
projectsData['compound_sentiment_score'] = compoundSentimentScores;
projectsData.shape
Out[124]:
(109245, 30)

Splitting Data(Only training and test)

In [125]:
projectsData = projectsData.dropna(subset = ['teacher_prefix']);
projectsData.shape
Out[125]:
(109245, 30)
In [126]:
classesData = projectsData['project_is_approved']
print(classesData.shape)
(109245,)
In [0]:
trainingData, testData, classesTraining, classesTest = model_selection.train_test_split(projectsData, classesData, test_size =  0.3, random_state = 0, stratify = classesData);
trainingData, crossValidateData, classesTraining, classesCrossValidate = model_selection.train_test_split(trainingData, classesTraining, test_size = 0.3, random_state = 0, stratify = classesTraining);
In [128]:
print("Shapes of splitted data: ");
equalsBorder(70);

print("testData shape: ", testData.shape);
print("classesTest: ", classesTest.shape);
print("trainingData shape: ", trainingData.shape);
print("classesTraining shape: ", classesTraining.shape);
Shapes of splitted data: 
======================================================================
testData shape:  (32774, 30)
classesTest:  (32774,)
trainingData shape:  (53529, 30)
classesTraining shape:  (53529,)
In [129]:
print("Number of negative points: ", trainingData[trainingData['project_is_approved'] == 0].shape);
print("Number of positive points: ", trainingData[trainingData['project_is_approved'] == 1].shape);
Number of negative points:  (8105, 30)
Number of positive points:  (45424, 30)
In [0]:
vectorizedFeatureNames = [];

Balancing Data

Note: Instead of displaying whole vectorization process for balanced and imbalanced data, we have simply disabled below cell while performing analysis on imbalanced data and enabled while performing analysis on balanced data
In [167]:
negativeData = trainingData[trainingData['project_is_approved'] == 0];
positiveData = trainingData[trainingData['project_is_approved'] == 1];
negativeDataBalanced = resample(negativeData, replace = True, n_samples = trainingData[trainingData['project_is_approved'] == 1].shape[0], random_state = 44);
trainingData = pd.concat([positiveData, negativeDataBalanced]);
trainingData = shuffle(trainingData);
classesTraining = trainingData['project_is_approved'];
print("Testing whether data is balanced: ");
equalsBorder(60);
print("Number of positive points: ", trainingData[trainingData['project_is_approved'] == 1].shape);
print("Number of negative points: ", trainingData[trainingData['project_is_approved'] == 0].shape);
Testing whether data is balanced: 
============================================================
Number of positive points:  (45424, 30)
Number of negative points:  (45424, 30)

Vectorizing categorical data

1. Vectorizing cleaned_categories(project_subject_categories cleaned) - One Hot Encoding

In [0]:
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique cleaned_categories
subjectsCategoriesVectorizer = CountVectorizer(vocabulary = list(sortedCategoriesDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with cleaned_categories values
subjectsCategoriesVectorizer.fit(trainingData['cleaned_categories'].values);
# Vectorizing categories using one-hot-encoding
categoriesVectors = subjectsCategoriesVectorizer.transform(trainingData['cleaned_categories'].values);
In [169]:
print("Features used in vectorizing categories: ");
equalsBorder(70);
print(subjectsCategoriesVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of cleaned_categories matrix after vectorization(one-hot-encoding): ", categoriesVectors.shape);
equalsBorder(70);
print("Sample vectors of categories: ");
equalsBorder(70);
print(categoriesVectors[0:4])
Features used in vectorizing categories: 
======================================================================
['Warmth', 'Care_Hunger', 'History_Civics', 'Music_Arts', 'AppliedLearning', 'SpecialNeeds', 'Health_Sports', 'Math_Science', 'Literacy_Language']
======================================================================
Shape of cleaned_categories matrix after vectorization(one-hot-encoding):  (90848, 9)
======================================================================
Sample vectors of categories: 
======================================================================
  (0, 8)	1
  (1, 3)	1
  (1, 7)	1
  (2, 7)	1
  (2, 8)	1
  (3, 3)	1

2. Vectorizing cleaned_sub_categories(project_subject_sub_categories cleaned) - One Hot Encoding

In [0]:
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique cleaned_sub_categories
subjectsSubCategoriesVectorizer = CountVectorizer(vocabulary = list(sortedDictionarySubCategories.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with cleaned_sub_categories values
subjectsSubCategoriesVectorizer.fit(trainingData['cleaned_sub_categories'].values);
# Vectorizing sub categories using one-hot-encoding
subCategoriesVectors = subjectsSubCategoriesVectorizer.transform(trainingData['cleaned_sub_categories'].values);
In [171]:
print("Features used in vectorizing subject sub categories: ");
equalsBorder(70);
print(subjectsSubCategoriesVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of cleaned_categories matrix after vectorization(one-hot-encoding): ", subCategoriesVectors.shape);
equalsBorder(70);
print("Sample vectors of categories: ");
equalsBorder(70);
print(subCategoriesVectors[0:4])
Features used in vectorizing subject sub categories: 
======================================================================
['Economics', 'CommunityService', 'FinancialLiteracy', 'ParentInvolvement', 'Extracurricular', 'Civics_Government', 'ForeignLanguages', 'NutritionEducation', 'Warmth', 'Care_Hunger', 'SocialSciences', 'PerformingArts', 'CharacterEducation', 'TeamSports', 'Other', 'College_CareerPrep', 'Music', 'History_Geography', 'Health_LifeScience', 'EarlyDevelopment', 'ESL', 'Gym_Fitness', 'EnvironmentalScience', 'VisualArts', 'Health_Wellness', 'AppliedSciences', 'SpecialNeeds', 'Literature_Writing', 'Mathematics', 'Literacy']
======================================================================
Shape of cleaned_categories matrix after vectorization(one-hot-encoding):  (90848, 30)
======================================================================
Sample vectors of categories: 
======================================================================
  (0, 29)	1
  (1, 23)	1
  (1, 25)	1
  (2, 22)	1
  (2, 29)	1
  (3, 23)	1

3. Vectorizing teacher_prefix - One Hot Encoding

In [0]:
def giveCounter(data):
    counter = Counter();
    for dataValue in data:
        counter.update(str(dataValue).split());
    return counter
In [173]:
giveCounter(trainingData['teacher_prefix'].values)
Out[173]:
Counter({'Dr': 7, 'Mr': 8836, 'Mrs': 46892, 'Ms': 32914, 'Teacher': 2199})
In [0]:
teacherPrefixDictionary = dict(giveCounter(trainingData['teacher_prefix'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique teacher_prefix
teacherPrefixVectorizer = CountVectorizer(vocabulary = list(teacherPrefixDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with teacher_prefix values
teacherPrefixVectorizer.fit(trainingData['teacher_prefix'].values);
# Vectorizing teacher_prefix using one-hot-encoding
teacherPrefixVectors = teacherPrefixVectorizer.transform(trainingData['teacher_prefix'].values);
In [175]:
print("Features used in vectorizing teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of teacher_prefix matrix after vectorization(one-hot-encoding): ", teacherPrefixVectors.shape);
equalsBorder(70);
print("Sample vectors of teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectors[0:100]);
Features used in vectorizing teacher_prefix: 
======================================================================
['Ms', 'Mr', 'Mrs', 'Teacher', 'Dr']
======================================================================
Shape of teacher_prefix matrix after vectorization(one-hot-encoding):  (90848, 5)
======================================================================
Sample vectors of teacher_prefix: 
======================================================================
  (0, 0)	1
  (1, 1)	1
  (2, 2)	1
  (3, 2)	1
  (4, 0)	1
  (5, 0)	1
  (6, 1)	1
  (7, 0)	1
  (8, 2)	1
  (9, 2)	1
  (10, 0)	1
  (11, 1)	1
  (12, 0)	1
  (13, 2)	1
  (14, 1)	1
  (15, 2)	1
  (16, 0)	1
  (17, 0)	1
  (18, 0)	1
  (19, 0)	1
  (20, 0)	1
  (21, 1)	1
  (22, 3)	1
  (23, 0)	1
  (24, 0)	1
  :	:
  (75, 2)	1
  (76, 2)	1
  (77, 2)	1
  (78, 2)	1
  (79, 1)	1
  (80, 0)	1
  (81, 2)	1
  (82, 2)	1
  (83, 0)	1
  (84, 2)	1
  (85, 1)	1
  (86, 2)	1
  (87, 3)	1
  (88, 0)	1
  (89, 1)	1
  (90, 0)	1
  (91, 2)	1
  (92, 0)	1
  (93, 2)	1
  (94, 1)	1
  (95, 2)	1
  (96, 2)	1
  (97, 1)	1
  (98, 0)	1
  (99, 2)	1
In [176]:
teacherPrefixes = [prefix.replace('.', '') for prefix in trainingData['teacher_prefix'].values];
teacherPrefixes[0:5]
Out[176]:
['Ms', 'Mr', 'Mrs', 'Mrs', 'Ms']
In [177]:
trainingData['teacher_prefix'] = teacherPrefixes;
trainingData.head(3)
Out[177]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved cleaned_categories cleaned_sub_categories project_essay price quantity preprocessed_titles preprocessed_essays number_of_words_in_title number_of_words_in_essay positive_sentiment_score negative_sentiment_score neutral_sentiment_score compound_sentiment_score
62811 80292 p139019 83c39c23b042155caae7cd1421c0b13d Ms FL 2016-08-04 13:44:24 GradesPreKto2 Literacy & Language Literacy Books Can Take You Places The students in my class are primarily of Hait... I always tell my students that reading is the ... NaN NaN My students need a text rich environment! The... 0 0 Literacy_Language Literacy The students in my class are primarily of Hait... 498.93 28 books take places students class primarily haitian guatemalan de... 3 128 0.154 0.087 0.759 0.9019
16391 8304 p035341 ebf2e9a430ed029bfd97b1405661eb3c Mr MA 2017-02-02 17:07:05 Grades9to12 Math & Science, Music & The Arts Applied Sciences, Visual Arts 3Doodler 3D Drawing Technology \r\nSTEAM(ing) ... We are an Alternative High School based on phi... Our students will use 3Doodler 3D drawing pens... NaN NaN My students need a 3Doodler 3D pen EDU Half Bu... 21 1 Math_Science Music_Arts AppliedSciences VisualArts We are an Alternative High School based on phi... 599.00 1 3doodler 3d drawing technology steam ing towar... alternative high school based philosophies tea... 8 144 0.272 0.042 0.685 0.9901
56138 115945 p118782 3a00fc48214a28d533c05a4bf0e1c2ff Mrs AR 2016-06-27 18:16:55 Grades3to5 Math & Science, Literacy & Language Environmental Science, Literacy Sizzling Science Materials Our school serves students in grades K-3 and p... My students will use these hands-on science ac... NaN NaN My students need hands-on science materials li... 5 0 Math_Science Literacy_Language EnvironmentalScience Literacy Our school serves students in grades K-3 and p... 357.62 12 sizzling science materials school serves students grades k 3 provides rig... 3 111 0.217 0.134 0.650 0.9088
In [0]:
teacherPrefixDictionary = dict(giveCounter(trainingData['teacher_prefix'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique teacher_prefix
teacherPrefixVectorizer = CountVectorizer(vocabulary = list(teacherPrefixDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with teacher_prefix values
teacherPrefixVectorizer.fit(trainingData['teacher_prefix'].values);
# Vectorizing teacher_prefix using one-hot-encoding
teacherPrefixVectors = teacherPrefixVectorizer.transform(trainingData['teacher_prefix'].values);
In [179]:
print("Features used in vectorizing teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of teacher_prefix matrix after vectorization(one-hot-encoding): ", teacherPrefixVectors.shape);
equalsBorder(70);
print("Sample vectors of teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectors[0:4]);
Features used in vectorizing teacher_prefix: 
======================================================================
['Ms', 'Mr', 'Mrs', 'Teacher', 'Dr']
======================================================================
Shape of teacher_prefix matrix after vectorization(one-hot-encoding):  (90848, 5)
======================================================================
Sample vectors of teacher_prefix: 
======================================================================
  (0, 0)	1
  (1, 1)	1
  (2, 2)	1
  (3, 2)	1

4. Vectorizing school_state - One Hot Encoding

In [0]:
schoolStateDictionary = dict(giveCounter(trainingData['school_state'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique school states
schoolStateVectorizer = CountVectorizer(vocabulary = list(schoolStateDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with school_state values
schoolStateVectorizer.fit(trainingData['school_state'].values);
# Vectorizing school_state using one-hot-encoding
schoolStateVectors = schoolStateVectorizer.transform(trainingData['school_state'].values);
In [181]:
print("Features used in vectorizing school_state: ");
equalsBorder(70);
print(schoolStateVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of school_state matrix after vectorization(one-hot-encoding): ", schoolStateVectors.shape);
equalsBorder(70);
print("Sample vectors of school_state: ");
equalsBorder(70);
print(schoolStateVectors[0:4]);
Features used in vectorizing school_state: 
======================================================================
['FL', 'MA', 'AR', 'NY', 'OR', 'AZ', 'MI', 'CA', 'SC', 'MN', 'VA', 'NC', 'GA', 'NH', 'UT', 'AK', 'CO', 'DC', 'TX', 'TN', 'IL', 'IN', 'IA', 'WI', 'WA', 'HI', 'MD', 'MO', 'NJ', 'OH', 'ID', 'LA', 'KS', 'RI', 'PA', 'OK', 'NV', 'KY', 'AL', 'DE', 'WY', 'WV', 'CT', 'ME', 'MS', 'NE', 'NM', 'MT', 'ND', 'SD', 'VT']
======================================================================
Shape of school_state matrix after vectorization(one-hot-encoding):  (90848, 51)
======================================================================
Sample vectors of school_state: 
======================================================================
  (0, 0)	1
  (1, 1)	1
  (2, 2)	1
  (3, 3)	1

5. Vectorizing project_grade_category - One Hot Encoding

In [182]:
giveCounter(trainingData['project_grade_category'])
Out[182]:
Counter({'Grades3to5': 30465,
         'Grades6to8': 14131,
         'Grades9to12': 9136,
         'GradesPreKto2': 37116})
In [183]:
cleanedGrades = []
for grade in trainingData['project_grade_category'].values:
    grade = grade.replace(' ', '');
    grade = grade.replace('-', 'to');
    cleanedGrades.append(grade);
cleanedGrades[0:4]
Out[183]:
['GradesPreKto2', 'Grades9to12', 'Grades3to5', 'Grades3to5']
In [184]:
trainingData['project_grade_category'] = cleanedGrades
trainingData.head(4)
Out[184]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved cleaned_categories cleaned_sub_categories project_essay price quantity preprocessed_titles preprocessed_essays number_of_words_in_title number_of_words_in_essay positive_sentiment_score negative_sentiment_score neutral_sentiment_score compound_sentiment_score
62811 80292 p139019 83c39c23b042155caae7cd1421c0b13d Ms FL 2016-08-04 13:44:24 GradesPreKto2 Literacy & Language Literacy Books Can Take You Places The students in my class are primarily of Hait... I always tell my students that reading is the ... NaN NaN My students need a text rich environment! The... 0 0 Literacy_Language Literacy The students in my class are primarily of Hait... 498.93 28 books take places students class primarily haitian guatemalan de... 3 128 0.154 0.087 0.759 0.9019
16391 8304 p035341 ebf2e9a430ed029bfd97b1405661eb3c Mr MA 2017-02-02 17:07:05 Grades9to12 Math & Science, Music & The Arts Applied Sciences, Visual Arts 3Doodler 3D Drawing Technology \r\nSTEAM(ing) ... We are an Alternative High School based on phi... Our students will use 3Doodler 3D drawing pens... NaN NaN My students need a 3Doodler 3D pen EDU Half Bu... 21 1 Math_Science Music_Arts AppliedSciences VisualArts We are an Alternative High School based on phi... 599.00 1 3doodler 3d drawing technology steam ing towar... alternative high school based philosophies tea... 8 144 0.272 0.042 0.685 0.9901
56138 115945 p118782 3a00fc48214a28d533c05a4bf0e1c2ff Mrs AR 2016-06-27 18:16:55 Grades3to5 Math & Science, Literacy & Language Environmental Science, Literacy Sizzling Science Materials Our school serves students in grades K-3 and p... My students will use these hands-on science ac... NaN NaN My students need hands-on science materials li... 5 0 Math_Science Literacy_Language EnvironmentalScience Literacy Our school serves students in grades K-3 and p... 357.62 12 sizzling science materials school serves students grades k 3 provides rig... 3 111 0.217 0.134 0.650 0.9088
15015 67654 p005945 305ebf046b828997bde45a7ba7a42ccd Mrs NY 2016-11-08 07:56:12 Grades3to5 Music & The Arts Visual Arts We have broken crayons... help!\r\n My students are the most grateful people I hav... Our students need some new crayons in art clas... NaN NaN My students need some new crayons. Our crayon... 52 1 Music_Arts VisualArts My students are the most grateful people I hav... 18.99 16 broken crayons help students grateful people ever met simplest thi... 3 94 0.344 0.025 0.631 0.9887
In [0]:
projectGradeDictionary = dict(giveCounter(trainingData['project_grade_category'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique project grade categories
projectGradeVectorizer = CountVectorizer(vocabulary = list(projectGradeDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with project_grade_category values
projectGradeVectorizer.fit(trainingData['project_grade_category'].values);
# Vectorizing project_grade_category using one-hot-encoding
projectGradeVectors = projectGradeVectorizer.transform(trainingData['project_grade_category'].values);
In [186]:
print("Features used in vectorizing project_grade_category: ");
equalsBorder(70);
print(projectGradeVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of school_state matrix after vectorization(one-hot-encoding): ", projectGradeVectors.shape);
equalsBorder(70);
print("Sample vectors of school_state: ");
equalsBorder(70);
print(projectGradeVectors[0:4]);
Features used in vectorizing project_grade_category: 
======================================================================
['GradesPreKto2', 'Grades9to12', 'Grades3to5', 'Grades6to8']
======================================================================
Shape of school_state matrix after vectorization(one-hot-encoding):  (90848, 4)
======================================================================
Sample vectors of school_state: 
======================================================================
  (0, 0)	1
  (1, 1)	1
  (2, 2)	1
  (3, 2)	1

Tf-Idf Vectorization

1. Vectorizing project_essay

In [0]:
# Intializing tfidf vectorizer for tf-idf vectorization of preprocessed project essays
tfIdfEssayVectorizer = TfidfVectorizer(min_df = 10, max_features = 5000);
# Transforming the preprocessed project essays to tf-idf vectors
tfIdfEssayModel = tfIdfEssayVectorizer.fit_transform(preProcessedEssaysWithoutStopWords);
In [188]:
print("Some of the Features used in tf-idf vectorizing preprocessed essays: ");
equalsBorder(70);
print(tfIdfEssayVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after tf-idf vectorization: ", tfIdfEssayModel.shape);
equalsBorder(70);
print("Sample Tf-Idf vector of preprocessed essay: ");
equalsBorder(70);
print(tfIdfEssayModel[0])
Some of the Features used in tf-idf vectorizing preprocessed essays: 
======================================================================
['worrying', 'worst', 'worth', 'worthwhile', 'worthy', 'would', 'wow', 'write', 'writer', 'writers', 'writing', 'writings', 'written', 'wrong', 'wrote', 'xylophones', 'yard', 'year', 'yearbook', 'yearly', 'yearn', 'yearning', 'years', 'yes', 'yesterday', 'yet', 'yoga', 'york', 'young', 'younger', 'youngest', 'youngsters', 'youth', 'youtube', 'zero', 'zest', 'zip', 'zone', 'zones', 'zoo']
======================================================================
Shape of preprocessed title matrix after tf-idf vectorization:  (109248, 5000)
======================================================================
Sample Tf-Idf vector of preprocessed essay: 
======================================================================
  (0, 3013)	0.015965240695453155
  (0, 1488)	0.10227077629951559
  (0, 900)	0.026463005286219803
  (0, 4982)	0.04582647393654424
  (0, 3375)	0.0625444219876457
  (0, 4977)	0.0306752753684296
  (0, 4752)	0.05440679396599839
  (0, 780)	0.07662037632839458
  (0, 3169)	0.03781481331251044
  (0, 3390)	0.18321040329163196
  (0, 108)	0.040000071711429254
  (0, 3091)	0.022124698215231303
  (0, 1433)	0.061920444892322624
  (0, 1246)	0.05054577035223512
  (0, 4851)	0.06986541769014476
  (0, 3730)	0.09018028989080147
  (0, 4021)	0.09616413965528452
  (0, 4479)	0.03754788787263396
  (0, 805)	0.07832945191677794
  (0, 4224)	0.10949652575968567
  (0, 4795)	0.20453745972832738
  (0, 1468)	0.10442911392080413
  (0, 3256)	0.045907362380733514
  (0, 3137)	0.08665307470908791
  (0, 4290)	0.0668235584225867
  :	:
  (0, 2819)	0.18267578272530896
  (0, 2657)	0.08246396864201584
  (0, 1645)	0.03290277997628051
  (0, 3540)	0.09266878292993941
  (0, 2629)	0.23032992958601448
  (0, 3788)	0.18439470661000118
  (0, 31)	0.08003880477371976
  (0, 3958)	0.03546008600799206
  (0, 2591)	0.1200129277454165
  (0, 2032)	0.08302595052382315
  (0, 622)	0.0797021927845829
  (0, 262)	0.09317692266510744
  (0, 579)	0.08993661029788748
  (0, 3019)	0.07694246325409376
  (0, 2316)	0.09044713807559518
  (0, 3723)	0.09806756817480866
  (0, 3439)	0.09819479507450754
  (0, 2861)	0.09995313270720342
  (0, 2592)	0.13131596546189067
  (0, 4546)	0.058915439503183835
  (0, 3986)	0.04932114628372647
  (0, 4944)	0.03804356418624494
  (0, 2630)	0.03600810586474103
  (0, 1564)	0.29767755886622843
  (0, 4364)	0.07692419628496143

Vectorizing numerical features

1. Vectorizing price

In [0]:
# Standardizing the price data using StandardScaler(Uses mean and std for standardization)
priceScaler = MinMaxScaler();
priceScaler.fit(trainingData['price'].values.reshape(-1, 1));
priceStandardized = priceScaler.transform(trainingData['price'].values.reshape(-1, 1));
In [190]:
print("Shape of standardized matrix of prices: ", priceStandardized.shape);
equalsBorder(70);
print("Sample original prices: ");
equalsBorder(70);
print(trainingData['price'].values[0:5]);
print("Sample standardized prices: ");
equalsBorder(70);
print(priceStandardized[0:5]);
Shape of standardized matrix of prices:  (90848, 1)
======================================================================
Sample original prices: 
======================================================================
[498.93 599.   357.62  18.99 117.55]
Sample standardized prices: 
======================================================================
[[0.04983527]
 [0.05984393]
 [0.03570193]
 [0.0018333 ]
 [0.01169094]]

2. Vectorizing quantity

In [0]:
# Standardizing the quantity data using StandardScaler(Uses mean and std for standardization)
quantityScaler = MinMaxScaler();
quantityScaler.fit(trainingData['quantity'].values.reshape(-1, 1));
quantityStandardized = quantityScaler.transform(trainingData['quantity'].values.reshape(-1, 1));
In [192]:
print("Shape of standardized matrix of quantities: ", quantityStandardized.shape);
equalsBorder(70);
print("Sample original quantities: ");
equalsBorder(70);
print(trainingData['quantity'].values[0:5]);
print("Sample standardized quantities: ");
equalsBorder(70);
print(quantityStandardized[0:5]);
Shape of standardized matrix of quantities:  (90848, 1)
======================================================================
Sample original quantities: 
======================================================================
[28  1 12 16 41]
Sample standardized quantities: 
======================================================================
[[0.02906351]
 [0.        ]
 [0.01184069]
 [0.01614639]
 [0.04305705]]

3. Vectorizing teacher_number_of_previously_posted_projects

In [0]:
# Standardizing the teacher_number_of_previously_posted_projects data using StandardScaler(Uses mean and std for standardization)
previouslyPostedScaler = MinMaxScaler();
previouslyPostedScaler.fit(trainingData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
previouslyPostedStandardized = previouslyPostedScaler.transform(trainingData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
In [194]:
print("Shape of standardized matrix of teacher_number_of_previously_posted_projects: ", previouslyPostedStandardized.shape);
equalsBorder(70);
print("Sample original number of previously posted projects: ");
equalsBorder(70);
print(trainingData['teacher_number_of_previously_posted_projects'].values[0:5]);
print("Sample standardized teacher_number_of_previously_posted_projects: ");
equalsBorder(70);
print(previouslyPostedStandardized[0:5]);
Shape of standardized matrix of teacher_number_of_previously_posted_projects:  (90848, 1)
======================================================================
Sample original number of previously posted projects: 
======================================================================
[ 0 21  5 52  0]
Sample standardized teacher_number_of_previously_posted_projects: 
======================================================================
[[0.        ]
 [0.04656319]
 [0.01108647]
 [0.11529933]
 [0.        ]]

4. Vectorizing number_of_words_in_title

In [0]:
numberOfWordsInTitleScaler = MinMaxScaler();
numberOfWordsInTitleScaler.fit(trainingData['number_of_words_in_title'].values.reshape(-1, 1));
numberOfWordsInTitleStandardized = numberOfWordsInTitleScaler.transform(trainingData['number_of_words_in_title'].values.reshape(-1, 1));
In [196]:
print("Shape of standardized matrix of number_of_words_in_title: ", numberOfWordsInTitleStandardized.shape);
equalsBorder(70);
print("Sample original number of words in title: ");
equalsBorder(70);
print(trainingData['number_of_words_in_title'].values[0:5]);
print("Sample standardized number_of_words_in_title: ");
equalsBorder(70);
print(numberOfWordsInTitleStandardized[0:5]);
Shape of standardized matrix of number_of_words_in_title:  (90848, 1)
======================================================================
Sample original number of words in title: 
======================================================================
[3 8 3 3 5]
Sample standardized number_of_words_in_title: 
======================================================================
[[0.27272727]
 [0.72727273]
 [0.27272727]
 [0.27272727]
 [0.45454545]]

5. Vectorizing number_of_words_in_essay

In [0]:
numberOfWordsInEssayScaler = MinMaxScaler();
numberOfWordsInEssayScaler.fit(trainingData['number_of_words_in_essay'].values.reshape(-1, 1));
numberOfWordsInEssayStandardized = numberOfWordsInEssayScaler.transform(trainingData['number_of_words_in_essay'].values.reshape(-1, 1));
In [198]:
print("Shape of standardized matrix of number_of_words_in_essay: ", numberOfWordsInEssayStandardized.shape);
equalsBorder(70);
print("Sample original number of words in essay: ");
equalsBorder(70);
print(trainingData['number_of_words_in_essay'].values[0:5]);
print("Sample standardized number_of_words_in_essay: ");
equalsBorder(70);
print(numberOfWordsInEssayStandardized[0:5]);
Shape of standardized matrix of number_of_words_in_essay:  (90848, 1)
======================================================================
Sample original number of words in essay: 
======================================================================
[128 144 111  94 114]
Sample standardized number_of_words_in_essay: 
======================================================================
[[0.23868313]
 [0.30452675]
 [0.16872428]
 [0.09876543]
 [0.18106996]]
In [0]:
numberOfPoints = previouslyPostedStandardized.shape[0];
# Categorical data
categoriesVectorsSub = categoriesVectors[0:numberOfPoints];
subCategoriesVectorsSub = subCategoriesVectors[0:numberOfPoints];
teacherPrefixVectorsSub = teacherPrefixVectors[0:numberOfPoints];
schoolStateVectorsSub = schoolStateVectors[0:numberOfPoints];
projectGradeVectorsSub = projectGradeVectors[0:numberOfPoints];

# Text data
tfIdfEssayModelSub = tfIdfEssayModel[0:numberOfPoints];

# Numerical data
priceStandardizedSub = priceStandardized[0:numberOfPoints];
quantityStandardizedSub = quantityStandardized[0:numberOfPoints];
previouslyPostedStandardizedSub = previouslyPostedStandardized[0:numberOfPoints];
numberOfWordsInTitleStandardizedSub = numberOfWordsInTitleStandardized[0:numberOfPoints];
numberOfWordsInEssayStandardizedSub = numberOfWordsInEssayStandardized[0:numberOfPoints];
positiveSentimentScoreSub = trainingData['positive_sentiment_score'].values[0:numberOfPoints].reshape(-1, 1);
negativeSentimentScoreSub = trainingData['negative_sentiment_score'].values[0:numberOfPoints].reshape(-1, 1);
neutralSentimentScoreSub = trainingData['neutral_sentiment_score'].values[0:numberOfPoints].reshape(-1, 1);
compoundSentimentScoreSub = trainingData['compound_sentiment_score'].values[0:numberOfPoints].reshape(-1, 1);

# Classes
classesTrainingSub = classesTraining;
In [5]:
supportVectorMachineResultsDataFrame = pd.DataFrame(columns =  ['Vectorizer', 'Model', 'Hyper Parameter - alpha', 'AUC', 'Data']);
supportVectorMachineResultsDataFrame
Out[5]:
Vectorizer Model Hyper Parameter - alpha AUC Data

Preparing cross validate data for analysis

In [201]:
# Test data categorical features transformation 
categoriesTransformedCrossValidateData = subjectsCategoriesVectorizer.transform(crossValidateData['cleaned_categories']);
subCategoriesTransformedCrossValidateData = subjectsSubCategoriesVectorizer.transform(crossValidateData['cleaned_sub_categories']);
teacherPrefixTransformedCrossValidateData = teacherPrefixVectorizer.transform(crossValidateData['teacher_prefix']);
schoolStateTransformedCrossValidateData = schoolStateVectorizer.transform(crossValidateData['school_state']);
projectGradeTransformedCrossValidateData = projectGradeVectorizer.transform(crossValidateData['project_grade_category']);

# Test data text features transformation
preProcessedEssaysTemp = preProcessingWithAndWithoutStopWords(crossValidateData['project_essay'])[1];
tfIdfEssayTransformedCrossValidateData = tfIdfEssayVectorizer.transform(preProcessedEssaysTemp);

# Test data numerical features transformation
priceTransformedCrossValidateData = priceScaler.transform(crossValidateData['price'].values.reshape(-1, 1));
quantityTransformedCrossValidateData = quantityScaler.transform(crossValidateData['quantity'].values.reshape(-1, 1));
previouslyPostedTransformedCrossValidateData = previouslyPostedScaler.transform(crossValidateData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
numberOfWordsInTitleTransformedCrossValidateData = numberOfWordsInTitleScaler.transform(crossValidateData['number_of_words_in_title'].values.reshape(-1, 1));
numberOfWordsInEssayTransformedCrossValidateData = numberOfWordsInEssayScaler.transform(crossValidateData['number_of_words_in_essay'].values.reshape(-1, 1));
positiveSentimentScoreCrossValidateData = crossValidateData['positive_sentiment_score'].values.reshape(-1, 1);
negativeSentimentScoreCrossValidateData = crossValidateData['negative_sentiment_score'].values.reshape(-1, 1);
neutralSentimentScoreCrossValidateData = crossValidateData['neutral_sentiment_score'].values.reshape(-1, 1);
compoundSentimentScoreCrossValidateData = crossValidateData['compound_sentiment_score'].values.reshape(-1, 1);

Preparing Test data for analysis

In [202]:
# Test data categorical features transformation 
categoriesTransformedTestData = subjectsCategoriesVectorizer.transform(testData['cleaned_categories']);
subCategoriesTransformedTestData = subjectsSubCategoriesVectorizer.transform(testData['cleaned_sub_categories']);
teacherPrefixTransformedTestData = teacherPrefixVectorizer.transform(testData['teacher_prefix']);
schoolStateTransformedTestData = schoolStateVectorizer.transform(testData['school_state']);
projectGradeTransformedTestData = projectGradeVectorizer.transform(testData['project_grade_category']);

# Test data text features transformation
preProcessedEssaysTemp = preProcessingWithAndWithoutStopWords(testData['project_essay'])[1];
tfIdfEssayTransformedTestData = tfIdfEssayVectorizer.transform(preProcessedEssaysTemp);


# Test data numerical features transformation
priceTransformedTestData = priceScaler.transform(testData['price'].values.reshape(-1, 1));
quantityTransformedTestData = quantityScaler.transform(testData['quantity'].values.reshape(-1, 1));
previouslyPostedTransformedTestData = previouslyPostedScaler.transform(testData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
numberOfWordsInTitleTransformedTestData = numberOfWordsInTitleScaler.transform(testData['number_of_words_in_title'].values.reshape(-1, 1));
numberOfWordsInEssayTransformedTestData = numberOfWordsInEssayScaler.transform(testData['number_of_words_in_essay'].values.reshape(-1, 1));
positiveSentimentScoreTestData = testData['positive_sentiment_score'].values.reshape(-1, 1);
negativeSentimentScoreTestData = testData['negative_sentiment_score'].values.reshape(-1, 1);
neutralSentimentScoreTestData = testData['neutral_sentiment_score'].values.reshape(-1, 1);
compoundSentimentScoreTestData = testData['compound_sentiment_score'].values.reshape(-1, 1);

Finding appropriate dimensions(less) using elbow method

In [0]:
trainingMergedData = hstack((categoriesVectorsSub,\
                             subCategoriesVectorsSub,\
                             teacherPrefixVectorsSub,\
                             schoolStateVectorsSub,\
                             projectGradeVectorsSub,\
                             priceStandardizedSub,\
                             previouslyPostedStandardizedSub,\
                             numberOfWordsInTitleStandardizedSub,\
                             numberOfWordsInEssayStandardizedSub,\
                             positiveSentimentScoreSub,\
                             negativeSentimentScoreSub,\
                             neutralSentimentScoreSub,\
                             compoundSentimentScoreSub, \
                            tfIdfEssayModelSub));
svd = TruncatedSVD(n_components = trainingMergedData.shape[1] - 1, random_state = 42);
svd.fit(trainingMergedData); 

componentsRatio = svd.explained_variance_ratio_;
In [0]:
components = np.arange(1, trainingMergedData.shape[1]);
componentsRatio =svd.explained_variance_ratio_.cumsum()

print(componentsRatio);

plt.xlabel('Number of components');
plt.ylabel('Variance');
plt.plot(components, componentsRatio, color = 'green');
[0.01792355 0.10202375 0.17900866 ... 1.         1.         1.        ]

Observations:

  1. As you can see from above plot that with dimensions more than 450 90% of variance is retained. So the less number of dimensions to start with inorder to get good results would be 450.
  2. At dimensions more than 1400 more than 95% of variance is retained.

Classification using imbalanced data with reduced dimensions by support vector machine

In [166]:
techniques = ['With Reduced Dimensions'];
n_componentsValues = [450, 600, 900, 1200, 1400];
for index, technique in enumerate(techniques):
  for n_components in n_componentsValues:
    trainingMergedData = hstack((categoriesVectorsSub,\
                             subCategoriesVectorsSub,\
                             teacherPrefixVectorsSub,\
                             schoolStateVectorsSub,\
                             projectGradeVectorsSub,\
                             priceStandardizedSub,\
                             previouslyPostedStandardizedSub,\
                             numberOfWordsInTitleStandardizedSub,\
                             numberOfWordsInEssayStandardizedSub,\
                             positiveSentimentScoreSub,\
                             negativeSentimentScoreSub,\
                             neutralSentimentScoreSub,\
                             compoundSentimentScoreSub,\
                             tfIdfEssayModelSub));
    svd = TruncatedSVD(n_components = n_components, random_state = 42);
    svd.fit(trainingMergedData);
    trainingMergedData = svd.transform(trainingMergedData);
    crossValidateMergedData = hstack((categoriesTransformedCrossValidateData,\
                                      subCategoriesTransformedCrossValidateData,\
                                      teacherPrefixTransformedCrossValidateData,\
                                      schoolStateTransformedCrossValidateData,\
                                      projectGradeTransformedCrossValidateData,\
                                      priceTransformedCrossValidateData,\
                                      previouslyPostedTransformedCrossValidateData,\
                                      numberOfWordsInTitleTransformedCrossValidateData,\
                                      numberOfWordsInEssayTransformedCrossValidateData,\
                                      positiveSentimentScoreCrossValidateData,\
                                      negativeSentimentScoreCrossValidateData,\
                                      neutralSentimentScoreCrossValidateData,\
                                      compoundSentimentScoreCrossValidateData,\
                                      tfIdfEssayTransformedCrossValidateData));
    crossValidateMergedData = svd.transform(crossValidateMergedData);

    testMergedData = hstack((categoriesTransformedTestData,\
                             subCategoriesTransformedTestData,\
                             teacherPrefixTransformedTestData,\
                             schoolStateTransformedTestData,\
                             projectGradeTransformedTestData,\
                             priceTransformedTestData,\
                             previouslyPostedTransformedTestData,\
                             numberOfWordsInTitleTransformedTestData,\
                             numberOfWordsInEssayTransformedTestData,\
                             positiveSentimentScoreTestData,\
                             negativeSentimentScoreTestData,\
                             neutralSentimentScoreTestData,\
                             compoundSentimentScoreTestData,\
                            tfIdfEssayTransformedTestData));
    testMergedData = svd.transform(testMergedData);

    svmClassifier = linear_model.SGDClassifier(loss = 'hinge');
    tunedParameters = {'alpha': [0.0001, 0.01, 0.1, 1, 10, 100, 10000]};
    classifier = GridSearchCV(svmClassifier, tunedParameters, cv = 5, scoring = 'roc_auc');
    classifier.fit(trainingMergedData, classesTrainingSub);

    crossValidateAucMeanValues = classifier.cv_results_['mean_test_score'];
    crossValidateAucStdValues = classifier.cv_results_['std_test_score'];

    plt.plot(tunedParameters['alpha'], crossValidateAucMeanValues, label = "Cross Validate AUC");
    plt.scatter(tunedParameters['alpha'], crossValidateAucMeanValues, label = ['Cross validate AUC values']);
    plt.gca().fill_between(tunedParameters['alpha'], crossValidateAucMeanValues - crossValidateAucStdValues, crossValidateAucMeanValues + crossValidateAucStdValues, alpha = 0.2, color = 'darkorange');
    plt.xlabel('Hyper parameter: alpha values');
    plt.ylabel('Scoring: AUC values');
    plt.grid();
    plt.legend();
    plt.show();

    optimalHypParamValue = classifier.best_params_['alpha'];
    svmClassifier = linear_model.SGDClassifier(loss = 'hinge', alpha = optimalHypParamValue);
    svmClassifier.fit(trainingMergedData, classesTrainingSub);
    predScoresTraining = svmClassifier.predict(trainingMergedData);
    fprTrain, tprTrain, thresholdTrain = roc_curve(classesTrainingSub, predScoresTraining);
    predScoresTest = svmClassifier.predict(testMergedData);
    fprTest, tprTest, thresholdTest = roc_curve(classesTest, predScoresTest);

    plt.plot(fprTrain, tprTrain, label = "Train AUC = " + str(auc(fprTrain, tprTrain)));
    plt.plot(fprTest, tprTest, label = "Test AUC = " + str(auc(fprTest, tprTest)));
    plt.plot([0, 1], [0, 1], 'k-');
    plt.xlabel("fpr values");
    plt.ylabel("tpr values");
    plt.grid();
    plt.legend();
    plt.show();

    areaUnderRocValueTest = auc(fprTest, tprTest);

    print("Results of analysis with dimensions({}) merged with tf-idf essays using support vector machine classifier: ".format(n_components, technique));
    equalsBorder(70);
    print("Optimal Hyper parameter Value: ", optimalHypParamValue);
    equalsBorder(40);
    print("AUC value of test data: ", str(areaUnderRocValueTest));
    # Predicting classes of test data projects
    predictionClassesTest = svmClassifier.predict(testMergedData);
    equalsBorder(40);
    # Printing confusion matrix
    confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
    # Creating dataframe for generated confusion matrix
    confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
    print("Confusion Matrix : ");
    equalsBorder(60);
    sbrn.heatmap(confusionMatrixDataFrame, annot = True, fmt = 'd', cmap="YlGnBu");
    plt.show();
    # Adding results to results dataframe
    supportVectorMachineResultsDataFrame = supportVectorMachineResultsDataFrame.append({'Vectorizer': technique, 'Model': 'SVM(SGD - hinge loss)', 'Hyper Parameter - alpha': optimalHypParamValue, 'AUC': areaUnderRocValueTest}, ignore_index = True);
Results of analysis with dimensions(450) merged with tf-idf essays using support vector machine classifier: 
======================================================================
Optimal Hyper parameter Value:  1
========================================
AUC value of test data:  0.5
========================================
Confusion Matrix : 
============================================================
Results of analysis with dimensions(600) merged with tf-idf essays using support vector machine classifier: 
======================================================================
Optimal Hyper parameter Value:  1
========================================
AUC value of test data:  0.5
========================================
Confusion Matrix : 
============================================================
Results of analysis with dimensions(900) merged with tf-idf essays using support vector machine classifier: 
======================================================================
Optimal Hyper parameter Value:  1
========================================
AUC value of test data:  0.5
========================================
Confusion Matrix : 
============================================================
Results of analysis with dimensions(1200) merged with tf-idf essays using support vector machine classifier: 
======================================================================
Optimal Hyper parameter Value:  0.0001
========================================
AUC value of test data:  0.5
========================================
Confusion Matrix : 
============================================================
Results of analysis with dimensions(1400) merged with tf-idf essays using support vector machine classifier: 
======================================================================
Optimal Hyper parameter Value:  0.01
========================================
AUC value of test data:  0.5
========================================
Confusion Matrix : 
============================================================

Classification using support vector machine with balanced data of reduced dimensions(no title feature)

In [203]:
techniques = ['With Reduced Dimensions'];
n_componentsValues = [450, 600, 900, 1200, 1400];
for index, technique in enumerate(techniques):
  for n_components in n_componentsValues:
    trainingMergedData = hstack((categoriesVectorsSub,\
                             subCategoriesVectorsSub,\
                             teacherPrefixVectorsSub,\
                             schoolStateVectorsSub,\
                             projectGradeVectorsSub,\
                             priceStandardizedSub,\
                             previouslyPostedStandardizedSub,\
                             numberOfWordsInTitleStandardizedSub,\
                             numberOfWordsInEssayStandardizedSub,\
                             positiveSentimentScoreSub,\
                             negativeSentimentScoreSub,\
                             neutralSentimentScoreSub,\
                             compoundSentimentScoreSub,\
                             tfIdfEssayModelSub));
    svd = TruncatedSVD(n_components = n_components, random_state = 42);
    svd.fit(trainingMergedData);
    trainingMergedData = svd.transform(trainingMergedData);
    crossValidateMergedData = hstack((categoriesTransformedCrossValidateData,\
                                      subCategoriesTransformedCrossValidateData,\
                                      teacherPrefixTransformedCrossValidateData,\
                                      schoolStateTransformedCrossValidateData,\
                                      projectGradeTransformedCrossValidateData,\
                                      priceTransformedCrossValidateData,\
                                      previouslyPostedTransformedCrossValidateData,\
                                      numberOfWordsInTitleTransformedCrossValidateData,\
                                      numberOfWordsInEssayTransformedCrossValidateData,\
                                      positiveSentimentScoreCrossValidateData,\
                                      negativeSentimentScoreCrossValidateData,\
                                      neutralSentimentScoreCrossValidateData,\
                                      compoundSentimentScoreCrossValidateData,\
                                      tfIdfEssayTransformedCrossValidateData));
    crossValidateMergedData = svd.transform(crossValidateMergedData);

    testMergedData = hstack((categoriesTransformedTestData,\
                             subCategoriesTransformedTestData,\
                             teacherPrefixTransformedTestData,\
                             schoolStateTransformedTestData,\
                             projectGradeTransformedTestData,\
                             priceTransformedTestData,\
                             previouslyPostedTransformedTestData,\
                             numberOfWordsInTitleTransformedTestData,\
                             numberOfWordsInEssayTransformedTestData,\
                             positiveSentimentScoreTestData,\
                             negativeSentimentScoreTestData,\
                             neutralSentimentScoreTestData,\
                             compoundSentimentScoreTestData,\
                            tfIdfEssayTransformedTestData));
    testMergedData = svd.transform(testMergedData);

    svmClassifier = linear_model.SGDClassifier(loss = 'hinge');
    tunedParameters = {'alpha': [0.0001, 0.01, 0.1, 1, 10, 100, 10000]};
    classifier = GridSearchCV(svmClassifier, tunedParameters, cv = 5, scoring = 'roc_auc');
    classifier.fit(trainingMergedData, classesTrainingSub);

    crossValidateAucMeanValues = classifier.cv_results_['mean_test_score'];
    crossValidateAucStdValues = classifier.cv_results_['std_test_score'];

    plt.plot(tunedParameters['alpha'], crossValidateAucMeanValues, label = "Cross Validate AUC");
    plt.scatter(tunedParameters['alpha'], crossValidateAucMeanValues, label = ['Cross validate AUC values']);
    plt.gca().fill_between(tunedParameters['alpha'], crossValidateAucMeanValues - crossValidateAucStdValues, crossValidateAucMeanValues + crossValidateAucStdValues, alpha = 0.2, color = 'darkorange');
    plt.xlabel('Hyper parameter: alpha values');
    plt.ylabel('Scoring: AUC values');
    plt.grid();
    plt.legend();
    plt.show();

    optimalHypParamValue = classifier.best_params_['alpha'];
    svmClassifier = linear_model.SGDClassifier(loss = 'hinge', alpha = optimalHypParamValue);
    svmClassifier.fit(trainingMergedData, classesTrainingSub);
    predScoresTraining = svmClassifier.predict(trainingMergedData);
    fprTrain, tprTrain, thresholdTrain = roc_curve(classesTrainingSub, predScoresTraining);
    predScoresTest = svmClassifier.predict(testMergedData);
    fprTest, tprTest, thresholdTest = roc_curve(classesTest, predScoresTest);

    plt.plot(fprTrain, tprTrain, label = "Train AUC = " + str(auc(fprTrain, tprTrain)));
    plt.plot(fprTest, tprTest, label = "Test AUC = " + str(auc(fprTest, tprTest)));
    plt.plot([0, 1], [0, 1], 'k-');
    plt.xlabel("fpr values");
    plt.ylabel("tpr values");
    plt.grid();
    plt.legend();
    plt.show();

    areaUnderRocValueTest = auc(fprTest, tprTest);

    print("Results of analysis with dimensions({}) merged with tf-idf essays using support vector machine classifier: ".format(n_components, technique));
    equalsBorder(70);
    print("Optimal Hyper parameter Value: ", optimalHypParamValue);
    equalsBorder(40);
    print("AUC value of test data: ", str(areaUnderRocValueTest));
    # Predicting classes of test data projects
    predictionClassesTest = svmClassifier.predict(testMergedData);
    equalsBorder(40);
    # Printing confusion matrix
    confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
    # Creating dataframe for generated confusion matrix
    confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
    print("Confusion Matrix : ");
    equalsBorder(60);
    sbrn.heatmap(confusionMatrixDataFrame, annot = True, fmt = 'd', cmap="YlGnBu");
    plt.show();
    # Adding results to results dataframe
    supportVectorMachineResultsDataFrame = supportVectorMachineResultsDataFrame.append({'Vectorizer': technique, 'Model': 'SVM(SGD - hinge loss)', 'Hyper Parameter - alpha': optimalHypParamValue, 'AUC': areaUnderRocValueTest}, ignore_index = True);
Results of analysis with dimensions(450) merged with tf-idf essays using support vector machine classifier: 
======================================================================
Optimal Hyper parameter Value:  0.0001
========================================
AUC value of test data:  0.5741491821761427
========================================
Confusion Matrix : 
============================================================
Results of analysis with dimensions(600) merged with tf-idf essays using support vector machine classifier: 
======================================================================
Optimal Hyper parameter Value:  0.0001
========================================
AUC value of test data:  0.5902067192517861
========================================
Confusion Matrix : 
============================================================
Results of analysis with dimensions(900) merged with tf-idf essays using support vector machine classifier: 
======================================================================
Optimal Hyper parameter Value:  0.0001
========================================
AUC value of test data:  0.587360106874942
========================================
Confusion Matrix : 
============================================================
Results of analysis with dimensions(1200) merged with tf-idf essays using support vector machine classifier: 
======================================================================
Optimal Hyper parameter Value:  0.0001
========================================
AUC value of test data:  0.5897170904613597
========================================
Confusion Matrix : 
============================================================
Results of analysis with dimensions(1400) merged with tf-idf essays using support vector machine classifier: 
======================================================================
Optimal Hyper parameter Value:  0.0001
========================================
AUC value of test data:  0.5894884161420233
========================================
Confusion Matrix : 
============================================================

Summary of results of above classification using support vector machine

In [8]:
supportVectorMachineResultsDataFrame
Out[8]:
Vectorizer Model Hyper Parameter - alpha AUC Data
0 Bag of words SVM(SGD - hinge loss) 0.0100 0.5000 Imbalanced
1 Tf-Idf SVM(SGD - hinge loss) 0.0001 0.5049 Imbalanced
2 Average Word2Vector SVM(SGD - hinge loss) 0.0001 0.5822 Imbalanced
3 Tf-Idf Weighted Word2Vector SVM(SGD - hinge loss) 0.0001 0.5821 Imbalanced
4 Bag of words SVM(SGD - hinge loss) 0.0100 0.6599 Balanced
5 Tf-Idf SVM(SGD - hinge loss) 0.0001 0.6441 Balanced
6 Average Word2Vector SVM(SGD - hinge loss) 0.0001 0.5968 Balanced
7 Tf-Idf Weighted Word2Vector SVM(SGD - hinge loss) 0.0100 0.6249 Balanced
8 Tf-Idf(dimensions - 450) SVM(SGD - hinge loss) 1.0000 0.5000 ImBalanced(no title feature)
9 Tf-Idf(dimensions - 600) SVM(SGD - hinge loss) 1.0000 0.5000 ImBalanced(no title feature)
10 Tf-Idf(dimensions - 900) SVM(SGD - hinge loss) 1.0000 0.5000 ImBalanced(no title feature)
11 Tf-Idf(dimensions - 1200) SVM(SGD - hinge loss) 0.0001 0.5000 ImBalanced(no title feature)
12 Tf-Idf(dimensions - 1400) SVM(SGD - hinge loss) 0.0100 0.5000 ImBalanced(no title feature)
13 Tf-Idf(dimensions - 450) SVM(SGD - hinge loss) 0.0001 0.5741 Balanced(no title feature)
14 Tf-Idf(dimensions - 600) SVM(SGD - hinge loss) 0.0001 0.5902 Balanced(no title feature)
15 Tf-Idf(dimensions - 900) SVM(SGD - hinge loss) 0.0001 0.5873 Balanced(no title feature)
16 Tf-Idf(dimensions - 1200) SVM(SGD - hinge loss) 0.0001 0.5897 Balanced(no title feature)
17 Tf-Idf(dimensions - 1400) SVM(SGD - hinge loss) 0.0001 0.5894 Balanced(no title feature)

Conclusions of above analysis

  1. From above analysis it seems that when data is balanced and text features are vectorized using bag of words the support vector machine is giving best results with auc value of 0.6599. The classification of negative points into negative points are also reasonable with this model but with other models it is kind of biased.
  2. When classification is done using imbalanced data with reduced dimensions it is giving auc value of 0.5 totally biased model(dumb model) that is classifying every point as positive. When classification is done after balancing, models seems better.
  3. At last the good combination would be using balanced data with all categorical features, numerical features and text features vectorized with bag of words technique and hyper parameter value as 0.1.